Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms
DOI:
https://doi.org/10.33557/journalisi.v1i2.18Keywords:
document clustering, cosine similarity, k-mainAbstract
Clustering is a useful technique that organizes a large number of non-sequential text documents into a small number of clusters that are meaningful and coherent. Effective and efficient organization of documents is needed, making it easy for intuitive and informative tracking mechanisms. In this paper, we proposed clustering documents using cosine similarity and k-main. The experimental results show that based on the experimental results the accuracy of our method is 84.3%.
Downloads
References
G. Salton and M. J. McGill, "Introduction to Modern Information Retrieval", McGraw-Hill, 1983.
R. Baeza-Yates, & B. D. A. N Ribeiro, "Modern information retrieval". New York: ACM Press; Harlow, England: Addison-Wesley, 2011.
M.F. Porter. "An algorithm for suffix stripping". Program, 14(3), 130-137. 1980.
A.K. Jain, M.N. Murty, & P.J. Flynn. "Data clustering: a review". ACM computing surveys (CSUR), 31(3), 264-323. 1999.
P. Willett. "Recent trends in hierarchic document clustering: a critical review". Information Processing and Management: an International Journal, 24(5):577–597, 1988.
J. Mcqueen. "Some methods for classification and analysis of multivariate observations". In Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, pages 281–297, Berkeley, CA, 1967.
A. K. Jain and R. C. Dubes. "Algorithms for Clustering Data". Prentice-Hall, Inc., Upper Saddle River, NJ, 1988.
L. Baker and A. McCallum. "Distributional clustering of words for text classification". In Proc. 1998 Int. Conf. on Research and Development in Information Retrieval (SI- GIR’98), pages 96–103, Melbourne, Australia, Aug. 1998.
C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. "A min-max cut algorithm for graph partitioning and data clustering". In Proc. 2001 Int. Conf. Data Mining (ICDM’01), pages 107–114, San Jose, CA, Nov. 2001.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise". In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD’96), pages 226–231, Portland, Oregon, Aug. 1996.
J. Shi and J. Malik. "Normalized cuts and image segmentation". IEEE Trans. on PAMI, 22(8):888–905, 2000.
Andrew Y. Ng, Michael Jordan, and Yair Weiss. "On spectral clustering: Analysis and an algorithm". In Advances in Neural Information Processing Systems 14, pages 849–856. MIT Press, Cambridge, MA, 2001.
H. Zha, C. Ding, M. Gu, X. He, and H. Simon. "Spectral relaxation for k-means clustering". In Advances in Neural Information Processing Systems 14, pages 1057–1064. MIT Press, Cambridge, MA, 2001.
Wei Xu, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization". In Proc. 2003 Int. Conf. on Research and Development in Information Retrieval (SIGIR’03), pages 267–273, Toronto, Canada, Aug. 2003.
Wei Xu and Yihong Gong. "Document clustering by concept factorization". In Proc. 2004 Int. Conf. on Research and Development in Information Retrieval (SIGIR’04), pages 202–209, Sheffield, UK, July 2004.
A. Anggrawan, K. Hidjah & Q.S. Jihadil. "Kidney failure diagnosis based on case-based reasoning (CBR) method and statistical analysis". In Informatics and Computing (ICIC), International Conference on (pp. 298-303). IEEE. 2016.
A. Anggrawan, & A. Azhari. "Aplikasi Deteksi Kemiripan Tugas Paper". Jurnal Matrik, 15(2), 5-10. 2016.
G. Salton. "Automatic Text Processing". Addison-Wesley, New York, 1989.
P. K. Chan, D. F. Schlag, and J. Y. Zien. "Spectral k-way ratio-cut partitioning and clustering". IEEE Trans. Computer-Aided Design, 13:1088–1096, 1994.
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. "Indexing by latent semantic analysis". In Journal of the American Society of Information Science, 41(6):391–407, 1990.
V. Singh and B. Saini, "An effective tokenization algorithm for information", pp. 109–119, 2014.
C. J. van Rijsbergen, "Information Retrieval", 2nd Edition, Butterworths, London, 1979.
M.F. Porter. "An Algorithm for Suffix Stripping", Program, 14(3): 130-137. 1980.
N. Sandhya, Y. S. Lalitha, V. Sowmya, K. Anuradha, and A. Govardhan, "Analysis of Stemming Algorithm for Text Clustering", vol. 8, no. 5, pp. 352–359, 2011.
G. Salton. "Automatic Text Processing". Addison-Wesley, New York, 1989.
M. Steinbach, G. Karypis, and V. Kumar. "A Comparison of Document Clustering Techniques". In KDD Workshop on Text Mining, 2000.
D. R. Cutting, J. O. Pedersen, D. R. Karger, and J. W. Tukey. "Scatter/gather: A cluster-based approach to browsing large document collections". In Proceedings of the ACM SIGIR, 1992.
B. Larsen and C. Aone. "Fast and Effective Text Mining using Linear-time Document Clustering". In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
D. Arthur and S. Vassilvitskii. "k-means++ the advantages of careful seeding". In Symposium on Discrete Algorithms, 2007.
Downloads
Published
Issue
Section
License
Authors Declaration
- The Authors certify that they have read, understood, and agreed to the Journal of Information Systems and Informatics (JournalISI) submission guidelines, policies, and submission declaration. The submission has been prepared using the provided template.
- The Authors certify that all authors have approved the publication of this manuscript and that there is no conflict of interest.
- The Authors confirm that the manuscript is their original work, has not received prior publication, is not under consideration for publication elsewhere, and has not been previously published.
- The Authors confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- The Authors confirm that the manuscript is not copied from or plagiarized from any other published work.
- The Authors declare that the manuscript will not be submitted for publication in any other journal or magazine until a decision is made by the journal editors.
- If the manuscript is finally accepted for publication, the Authors confirm that they will either proceed with publication immediately or withdraw the manuscript in accordance with the journal’s withdrawal policies.
- The Authors agree that, upon publication of the manuscript in this journal, they transfer copyright or assign exclusive rights to the publisher, including commercial rights














