Semantic-Enhanced News Clustering Using TF-IDF and WordNet with K-Means
DOI:
https://doi.org/10.63158/journalisi.v7i4.1360Keywords:
News Clustering, TF-IDF, Keyword Extraction, WordNet, Semantic Similarity, K-MeansAbstract
Text clustering of news articles falls under unsupervised learning, where models operate on unlabeled data unless partially annotated. K-Means Clustering remains one of the most commonly applied algorithms due to its efficiency and simplicity. Likewise, TF-IDF is a widely used approach for generating document feature matrices through statistical term weighting. Although still relevant, TF-IDF lacks the ability to represent contextual meaning, which often prevents semantically related news articles from forming coherent clusters when different syntactic variations are used. This limitation is evidenced by the baseline experiment, in which TF-IDF obtained a silhouette score of 0.011 at the optimal cluster configuration (k = 5). To overcome this limitation, this study introduces semantic enrichment using WordNet to improve similarity representation based on keywords extracted through TF-IDF, evaluated on 1000 documents sampled from 21,495 filtered records. The elbow method was applied to determine the optimal number of clusters. At the optimal k-value of 3, the proposed method achieved a silhouette score of 0.505, significantly outperforming the baseline TF-IDF representation despite utilizing fewer clusters. These results demonstrate that incorporating semantic information can enhance statistical text representations and produce more contextually coherent news clusters. To manage computational task, the model applies a first-POS strategy, where only the first synset derived from POS tagging is considered. While this reduces processing complexity, it may limit the model's ability to fully capture polysemy.
Downloads
References
D. B. Bisandu, R. Prasad, and M. M. Liman, “Clustering news articles using efficient similarity measure and N-grams,” Int. J. Knowl. Eng. Data Min., vol. 5, no. 4, p. 333, 2018, doi: 10.1504/IJKEDM.2018.095525.
N. Disayiram and R. A. H. M. Rupasingha, “A comparative study of clustering english news articles using clustering algorithms,” in 2022 International Research Conference on Smart Computing and Systems Engineering (SCSE), Colombo, Sri Lanka: IEEE, Sept. 2022, pp. 108–113. doi: 10.1109/SCSE56529.2022.9905210.
C. Bouras and V. Tsogkas, “A clustering technique for news articles using WordNet,” Knowl.-Based Syst., vol. 36, pp. 115–128, Dec. 2012, doi: 10.1016/j.knosys.2012.06.015.
A. El-Hamdouchi, “Comparison of hierarchic agglomerative clustering methods for document retrieval,” Comput. J., vol. 32, no. 3, pp. 220–227, Mar. 1989, doi: 10.1093/comjnl/32.3.220.
A. Subakti, H. Murfi, and N. Hariadi, “The performance of BERT as data representation of text clustering,” J. Big Data, vol. 9, no. 1, p. 15, Dec. 2022, doi: 10.1186/s40537-022-00564-9.
Z. Chen, C. Mi, S. Duo, J. He, and Y. Zhou, “ClusTop: An unsupervised and integrated text clustering and topic extraction framework,” Jan. 03, 2023, arXiv: arXiv:2301.00818. doi: 10.48550/arXiv.2301.00818.
H. T. A. Simanjuntak, P. E. P. Silaban, J. K. S. Manurung, and V. H. Sormin, “Klasterisasi berita bahasa indonesia dengan menggunakan k-means dan word embedding,” J. Teknol. Inf. Dan Ilmu Komput., vol. 10, no. 3, pp. 641–652, July 2023, doi: 10.25126/jtiik.20231026468.
S.-W. Kim and J.-M. Gil, “Research paper classification systems based on TF-IDF and LDA schemes,” Hum.-Centric Comput. Inf. Sci., vol. 9, no. 1, p. 30, Dec. 2019, doi: 10.1186/s13673-019-0192-7.
E. Kurniawan and N. Hendrastuty, “Penerapan algoritma k-means untuk melakukan klasterisasi pada peringkasan teks,” J. Inform. Teknol. Dan Sains Jinteks, vol. 6, no. 3, pp. 514–520, Aug. 2024, doi: 10.51401/jinteks.v6i3.4435.
Aubaidan, “Comparative study of k-means and k-means++ clustering algorithms on crime domain,” J. Comput. Sci., vol. 10, no. 7, pp. 1197–1206, July 2014, doi: 10.3844/jcssp.2014.1197.1206.
L. M. Abualigah, A. T. Khader, and M. A. Al-Betar, “Multi-objectives-based text clustering technique using K-mean algorithm,” in 2016 7th International Conference on Computer Science and Information Technology (CSIT), Amman, Jordan: IEEE, July 2016, pp. 1–6. doi: 10.1109/CSIT.2016.7549464.
J. Ravi and S. Kulkarni, “Text embedding techniques for efficient clustering of twitter data,” Evol. Intell., vol. 16, no. 5, pp. 1667–1677, Oct. 2023, doi: 10.1007/s12065-023-00825-3.
K. K. Saravanakumar, M. Ballesteros, M. K. Chandrasekaran, and K. McKeown, “Event-driven news stream clustering using entity-aware contextual embeddings,” Jan. 26, 2021, arXiv: arXiv:2101.11059. doi: 10.48550/arXiv.2101.11059.
S. Yeasmin, N. Afrin, and M. R. Huq, “Transformer-based text clustering for newspaper articles,” in machine intelligence and emerging technologies, vol. 490, Md. S. Satu, M. A. Moni, M. S. Kaiser, and M. S. Arefin, Eds., in Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol. 490. , Cham: Springer Nature Switzerland, 2023, pp. 443–457. doi: 10.1007/978-3-031-34619-4_35.
T. Wei, Y. Lu, H. Chang, Q. Zhou, and X. Bao, “A semantic approach for text clustering using WordNet and lexical chains,” Expert Syst. Appl., vol. 42, no. 4, pp. 2264–2275, Mar. 2015, doi: 10.1016/j.eswa.2014.10.023.
Kumar Saksham, “Global News Dataset.” Kaggle. doi: 10.34740/KAGGLE/DSV/7105651.
C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval, 1st ed. Cambridge University Press, 2008. doi: 10.1017/CBO9780511809071.
G. U. Abriani and M. A. Yaqin, “Implementasi metode semantic similarity untuk pengukuran kemiripan makna antar kalimat,” Ilk. J. Comput. Sci. Appl. Inform., vol. 1, no. 2, pp. 47–57, Dec. 2019, doi: 10.28926/ilkomnika.v1i2.15.
B. Montolalu and S. Rochimah, “Deteksi konflik leksikal pada diagram kelas menggunakan modifikasi graf dan similaritas wordnet,” Syst. Inf. Syst. Inform. J., vol. 3, no. 1, pp. 1–8, Aug. 2017, doi: 10.29080/systemic.v3i1.187.
A. Géron, Hands-On machine learning with scikit-learn, keras, and tensorflow: concepts, tools, and techniques to build intelligent systems, 2nd ed. Sebastopol: O’Reilly, 2019.
F. Malik, S. Khan, A. Rizwan, G. Atteia, and N. A. Samee, “A novel hybrid clustering approach based on black hole algorithm for document clustering,” IEEE Access, vol. 10, pp. 97310–97326, 2022, doi: 10.1109/ACCESS.2022.3202017.
J. Han and M. Kamber, Data mining: concepts and techniques, 3rd ed. Burlington, MA: Elsevier, 2012.
M. J. P. Canon, L. L. Maceda, and C. Y. Sy, “Clustering with enhanced word embeddings for contextual analysis in academic texts,” Int. J. Eng. Trends Technol., vol. 72, no. 6, pp. 170–177, June 2024, doi: 10.14445/22315381/IJETT-V72I6P118.
S. Das and U. Mert Cakmak, Hands-On Automated Machine Learning. Sciendo, 2018. doi: 10.0000/9781788622288.
C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer International Publishing, 2015. doi: 10.1007/978-3-319-14142-8.
Downloads
Published
Issue
Section
License
Authors Declaration
- The Authors certify that they have read, understood, and agreed to the Journal of Information Systems and Informatics (JournalISI) submission guidelines, policies, and submission declaration. The submission has been prepared using the provided template.
- The Authors certify that all authors have approved the publication of this manuscript and that there is no conflict of interest.
- The Authors confirm that the manuscript is their original work, has not received prior publication, is not under consideration for publication elsewhere, and has not been previously published.
- The Authors confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- The Authors confirm that the manuscript is not copied from or plagiarized from any other published work.
- The Authors declare that the manuscript will not be submitted for publication in any other journal or magazine until a decision is made by the journal editors.
- If the manuscript is finally accepted for publication, the Authors confirm that they will either proceed with publication immediately or withdraw the manuscript in accordance with the journal’s withdrawal policies.
- The Authors agree that, upon publication of the manuscript in this journal, they transfer copyright or assign exclusive rights to the publisher, including commercial rights














