Semantic-Enhanced News Clustering Using TF-IDF and WordNet with K-Means

Mohammad Yusuf Hidayat; Muhammad Ainul Yaqin; Zainal Abidin

doi:10.63158/journalisi.v7i4.1360

Authors

Mohammad Yusuf Hidayat Universitas Islam Negeri Maulana Malik Ibrahim Malang, Indonesia
Muhammad Ainul Yaqin Universitas Islam Negeri Maulana Malik Ibrahim Malang, Indonesia
Zainal Abidin Universitas Islam Negeri Maulana Malik Ibrahim Malang, Indonesia

DOI:

https://doi.org/10.63158/journalisi.v7i4.1360

Keywords:

News Clustering, TF-IDF, Keyword Extraction, WordNet, Semantic Similarity, K-Means

Abstract

Text clustering of news articles falls under unsupervised learning, where models operate on unlabeled data unless partially annotated. K-Means Clustering remains one of the most commonly applied algorithms due to its efficiency and simplicity. Likewise, TF-IDF is a widely used approach for generating document feature matrices through statistical term weighting. Although still relevant, TF-IDF lacks the ability to represent contextual meaning, which often prevents semantically related news articles from forming coherent clusters when different syntactic variations are used. This limitation is evidenced by the baseline experiment, in which TF-IDF obtained a silhouette score of 0.011 at the optimal cluster configuration (k = 5). To overcome this limitation, this study introduces semantic enrichment using WordNet to improve similarity representation based on keywords extracted through TF-IDF, evaluated on 1000 documents sampled from 21,495 filtered records. The elbow method was applied to determine the optimal number of clusters. At the optimal k-value of 3, the proposed method achieved a silhouette score of 0.505, significantly outperforming the baseline TF-IDF representation despite utilizing fewer clusters. These results demonstrate that incorporating semantic information can enhance statistical text representations and produce more contextually coherent news clusters. To manage computational task, the model applies a first-POS strategy, where only the first synset derived from POS tagging is considered. While this reduces processing complexity, it may limit the model's ability to fully capture polysemy.

Downloads

Download data is not yet available.

References

D. B. Bisandu, R. Prasad, and M. M. Liman, “Clustering news articles using efficient similarity measure and N-grams,” Int. J. Knowl. Eng. Data Min., vol. 5, no. 4, p. 333, 2018, doi: 10.1504/IJKEDM.2018.095525.

N. Disayiram and R. A. H. M. Rupasingha, “A comparative study of clustering english news articles using clustering algorithms,” in 2022 International Research Conference on Smart Computing and Systems Engineering (SCSE), Colombo, Sri Lanka: IEEE, Sept. 2022, pp. 108–113. doi: 10.1109/SCSE56529.2022.9905210.

C. Bouras and V. Tsogkas, “A clustering technique for news articles using WordNet,” Knowl.-Based Syst., vol. 36, pp. 115–128, Dec. 2012, doi: 10.1016/j.knosys.2012.06.015.

A. El-Hamdouchi, “Comparison of hierarchic agglomerative clustering methods for document retrieval,” Comput. J., vol. 32, no. 3, pp. 220–227, Mar. 1989, doi: 10.1093/comjnl/32.3.220.

A. Subakti, H. Murfi, and N. Hariadi, “The performance of BERT as data representation of text clustering,” J. Big Data, vol. 9, no. 1, p. 15, Dec. 2022, doi: 10.1186/s40537-022-00564-9.

Z. Chen, C. Mi, S. Duo, J. He, and Y. Zhou, “ClusTop: An unsupervised and integrated text clustering and topic extraction framework,” Jan. 03, 2023, arXiv: arXiv:2301.00818. doi: 10.48550/arXiv.2301.00818.

H. T. A. Simanjuntak, P. E. P. Silaban, J. K. S. Manurung, and V. H. Sormin, “Klasterisasi berita bahasa indonesia dengan menggunakan k-means dan word embedding,” J. Teknol. Inf. Dan Ilmu Komput., vol. 10, no. 3, pp. 641–652, July 2023, doi: 10.25126/jtiik.20231026468.

S.-W. Kim and J.-M. Gil, “Research paper classification systems based on TF-IDF and LDA schemes,” Hum.-Centric Comput. Inf. Sci., vol. 9, no. 1, p. 30, Dec. 2019, doi: 10.1186/s13673-019-0192-7.

E. Kurniawan and N. Hendrastuty, “Penerapan algoritma k-means untuk melakukan klasterisasi pada peringkasan teks,” J. Inform. Teknol. Dan Sains Jinteks, vol. 6, no. 3, pp. 514–520, Aug. 2024, doi: 10.51401/jinteks.v6i3.4435.

Aubaidan, “Comparative study of k-means and k-means++ clustering algorithms on crime domain,” J. Comput. Sci., vol. 10, no. 7, pp. 1197–1206, July 2014, doi: 10.3844/jcssp.2014.1197.1206.

L. M. Abualigah, A. T. Khader, and M. A. Al-Betar, “Multi-objectives-based text clustering technique using K-mean algorithm,” in 2016 7th International Conference on Computer Science and Information Technology (CSIT), Amman, Jordan: IEEE, July 2016, pp. 1–6. doi: 10.1109/CSIT.2016.7549464.

J. Ravi and S. Kulkarni, “Text embedding techniques for efficient clustering of twitter data,” Evol. Intell., vol. 16, no. 5, pp. 1667–1677, Oct. 2023, doi: 10.1007/s12065-023-00825-3.

K. K. Saravanakumar, M. Ballesteros, M. K. Chandrasekaran, and K. McKeown, “Event-driven news stream clustering using entity-aware contextual embeddings,” Jan. 26, 2021, arXiv: arXiv:2101.11059. doi: 10.48550/arXiv.2101.11059.

S. Yeasmin, N. Afrin, and M. R. Huq, “Transformer-based text clustering for newspaper articles,” in machine intelligence and emerging technologies, vol. 490, Md. S. Satu, M. A. Moni, M. S. Kaiser, and M. S. Arefin, Eds., in Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol. 490. , Cham: Springer Nature Switzerland, 2023, pp. 443–457. doi: 10.1007/978-3-031-34619-4_35.

T. Wei, Y. Lu, H. Chang, Q. Zhou, and X. Bao, “A semantic approach for text clustering using WordNet and lexical chains,” Expert Syst. Appl., vol. 42, no. 4, pp. 2264–2275, Mar. 2015, doi: 10.1016/j.eswa.2014.10.023.

Kumar Saksham, “Global News Dataset.” Kaggle. doi: 10.34740/KAGGLE/DSV/7105651.

C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval, 1st ed. Cambridge University Press, 2008. doi: 10.1017/CBO9780511809071.

G. U. Abriani and M. A. Yaqin, “Implementasi metode semantic similarity untuk pengukuran kemiripan makna antar kalimat,” Ilk. J. Comput. Sci. Appl. Inform., vol. 1, no. 2, pp. 47–57, Dec. 2019, doi: 10.28926/ilkomnika.v1i2.15.

B. Montolalu and S. Rochimah, “Deteksi konflik leksikal pada diagram kelas menggunakan modifikasi graf dan similaritas wordnet,” Syst. Inf. Syst. Inform. J., vol. 3, no. 1, pp. 1–8, Aug. 2017, doi: 10.29080/systemic.v3i1.187.

A. Géron, Hands-On machine learning with scikit-learn, keras, and tensorflow: concepts, tools, and techniques to build intelligent systems, 2nd ed. Sebastopol: O’Reilly, 2019.

F. Malik, S. Khan, A. Rizwan, G. Atteia, and N. A. Samee, “A novel hybrid clustering approach based on black hole algorithm for document clustering,” IEEE Access, vol. 10, pp. 97310–97326, 2022, doi: 10.1109/ACCESS.2022.3202017.

J. Han and M. Kamber, Data mining: concepts and techniques, 3rd ed. Burlington, MA: Elsevier, 2012.

M. J. P. Canon, L. L. Maceda, and C. Y. Sy, “Clustering with enhanced word embeddings for contextual analysis in academic texts,” Int. J. Eng. Trends Technol., vol. 72, no. 6, pp. 170–177, June 2024, doi: 10.14445/22315381/IJETT-V72I6P118.

S. Das and U. Mert Cakmak, Hands-On Automated Machine Learning. Sciendo, 2018. doi: 10.0000/9781788622288.

C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer International Publishing, 2015. doi: 10.1007/978-3-319-14142-8.

Semantic-Enhanced News Clustering Using TF-IDF and WordNet with K-Means

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

Most read articles by the same author(s)

publisher

sidebar

certificate

template

gs-citation

index

stat