Enhancing News Similarity with Chunking Strategy and Hyperparameter Setting on Hybrid SBERT - Node2Vec Model

Reza Ananta Permadi Supriyo; Urip Teguh Setijohatmo; Asri Maspupah

doi:10.51519/journalisi.v7i3.1180

Reza Ananta Permadi Supriyo Bandung State Polytechnic, Indonesia
Urip Teguh Setijohatmo Bandung State Polytechnic, Indonesia
Asri Maspupah Bandung State Polytechnic, Indonesia

DOI: 10.51519/journalisi.v7i3.1180

Keywords: News Similarity, SBERT, Node2Vec, Chunking, Graph Embedding

Abstract

The proliferation of online news necessitates accurate article similarity systems to combat information overload, yet models based solely on semantic content often ignore crucial structural context like news source and publication date. This research proposes and evaluates a hybrid embedding model that integrates semantic representations from Sentence-BERT (SBERT) with structural representations from Node2Vec. A series of quantitative experiments were conducted on the challenging, multilingual SPICED dataset to determine the optimal model configuration. Using Mean Squared Error (MSE) for evaluation, the results show that a per-paragraph chunking strategy yielded the best performance. This strategy's effectiveness was validated by the identical performance of an optimal fixed-size chunk (450 characters with a 64 overlap), a value that aligns closely with the dataset's average paragraph length. Furthermore, a community-focused (BFS-like) Node2Vec configuration (p=1.0, q=2.0, l=60) was identified as optimal for the structural component. Significantly, the final hybrid model (MSE = 0.1434) proved superior to both the purely semantic (MSE = 0.1449) and purely structural models (MSE = 0.2512). This study concludes that the fusion of content and context provides the most comprehensive and accurate representation for news similarity detection.

Downloads

Download data is not yet available.

References

J. Wang and Y. Dong, “Measurement of text similarity: A survey,” Inf., vol. 11, no. 9, pp. 1–17, 2020, doi: 10.3390/info11090421.

N. Pradhan, M. Gyanchandani, and R. Wadhvani, “A Review on Text Similarity Technique used in IR and its Application,” Int. J. Comput. Appl., vol. 120, no. 9, pp. 29–34, 2015, doi: 10.5120/21257-4109.

D. K. Wardy, I. K. G. D. Putra, and N. K. D. Rusjayanthi, “Clustering Artikel pada Portal Berita Online,” JITTER- J. Ilm. Teknol. dan Komput., vol. 3, no. 1, pp. 3–11, 2022.

K. Erk, “Vector Space Models of Word Meaning and Phrase Meaning: A Survey,” Linguist. Lang. Compass, vol. 6, no. 10, pp. 635–653, 2012, doi: 10.1002/lnco.362.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, no. Mlm, pp. 4171–4186, 2019.

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 3982–3992, 2019, doi: 10.18653/v1/d19-1410.

H.-S. Sheu and S. Li, “Context-aware Graph Embedding for Session-based News Recommendation by,” 2020.

A. Grover and J. Leskovec, “node2vec: Scalable Feature Learning for Networks,” HHS Public Access, pp. 607–612, 2016, doi: 10.1145/2939672.2939754.node2vec.

J. Wu, J. Sun, H. Sun, and G. Sun, “Performance Analysis of Graph Neural Network Frameworks,” Proc. - 2021 IEEE Int. Symp. Perform. Anal. Syst. Software, ISPASS 2021, pp. 118–127, 2021, doi: 10.1109/ISPASS51385.2021.00029.

L. Wu et al., “Word mover’s embedding: From word2vec to document embedding,” Proc. 2018 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2018, pp. 4524–4534, 2018, doi: 10.18653/v1/d18-1482.

L. Meng and N. Masuda, “Analysis of node2vec random walks on networks: Node2vec random walks on networks,” Proc. R. Soc. A Math. Phys. Eng. Sci., vol. 476, no. 2243, 2020, doi: 10.1098/rspa.2020.0447.

E. Shushkevich, M. V. Loureiro, L. Mai, S. Derby, and T. K. Wijaya, “SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels,” 2024 Jt. Int. Conf. Comput. Linguist. Lang. Resour. Eval. Lr. 2024 - Main Conf. Proc., pp. 15181–15190, 2024.

X. Wang, X. He, Y. Cao, M. Liu, and T. S. Chua, “KGAT: Knowledge graph attention network for recommendation,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 950–958, 2019, doi: 10.1145/3292500.3330989.

P. Verma, “S2 Chunking: A Hybrid Framework for Document Segmentation Through Integrated Spatial and Semantic Analysis,” pp. 1–10, 2025, [Online]. Available: http://arxiv.org/abs/2501.05485

M. Angelos Goulis Supervisor and dr Clara Stegehuis, “Optimising node2vec in Dynamic Graphs Through Local Retraining,” 2024.

N. Dehak, R. Dehak, J. Glass, D. Reynolds, and P. Kenny, “Cosine similarity scoring without score normalization techniques,” Odyssey 2010 Speak. Lang. Recognit. Work., pp. 71–75, 2010.

J. C. Nacher and T. Akutsu, “Analysis on critical nodes in controlling complex networks using dominating sets,” Proc. - 2013 Int. Conf. Signal-Image Technol. Internet-Based Syst. SITIS 2013, pp. 649–654, 2013, doi: 10.1109/SITIS.2013.106.

B. J. Goode and D. Datta, “A Geometric Approach to Predicting Bounds of Downstream Model Performance,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 1596–1604, 2020, doi: 10.1145/3394486.3403210.

Z. Huang, D. Liang, P. Xu, and B. Xiang, “Multiplicative Position-aware Transformer Models for Language Understanding,” no. usually 512, 2021, [Online]. Available: http://arxiv.org/abs/2109.12788

C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is All You Need in Speech Separation,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.13154

T. Mikolov, W. T. Yih, and G. Zweig, “Linguistic Regularities in Continuous Space Word Representations,” Proc. 2nd Work. Comput. Linguist. Lit. CLfL 2013 2013 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. NAACL-HLT 2013, pp. 746–751, 2015.

L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” 33rd AAAI Conf. Artif. Intell. AAAI 2019, 31st Innov. Appl. Artif. Intell. Conf. IAAI 2019 9th AAAI Symp. Educ. Adv. Artif. Intell. EAAI 2019, pp. 7370–7377, 2019, doi: 10.4000/books.aaccademia.4577.

V. Karpukhin et al., “Dense passage retrieval for open-domain question answering,” EMNLP 2020 - 2020 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., pp. 6769–6781, 2020, doi: 10.18653/v1/2020.emnlp-main.550.

H. Chen, S. F. Sultan, Y. Tian, M. Chen, and S. Skiena, “Fast and accurate network embeddings via very sparse random projection,” Int. Conf. Inf. Knowl. Manag. Proc., pp. 399–408, 2019, doi: 10.1145/3357384.3357879.

T. Chavan and S. Patil, “Named Entity Recognition (Ner) for News Articles,” Int. J. Adv. Res. Eng. Technol., vol. 2, no. 1, pp. 103–112, 2024, doi: 10.34218/ijaird.2.1.2024.10.