Enhancing News Similarity with Chunking Strategy and Hyperparameter Setting on Hybrid SBERT - Node2Vec Model
Abstract
The proliferation of online news necessitates accurate article similarity systems to combat information overload, yet models based solely on semantic content often ignore crucial structural context like news source and publication date. This research proposes and evaluates a hybrid embedding model that integrates semantic representations from Sentence-BERT (SBERT) with structural representations from Node2Vec. A series of quantitative experiments were conducted on the challenging, multilingual SPICED dataset to determine the optimal model configuration. Using Mean Squared Error (MSE) for evaluation, the results show that a per-paragraph chunking strategy yielded the best performance. This strategy's effectiveness was validated by the identical performance of an optimal fixed-size chunk (450 characters with a 64 overlap), a value that aligns closely with the dataset's average paragraph length. Furthermore, a community-focused (BFS-like) Node2Vec configuration (p=1.0, q=2.0, l=60) was identified as optimal for the structural component. Significantly, the final hybrid model (MSE = 0.1434) proved superior to both the purely semantic (MSE = 0.1449) and purely structural models (MSE = 0.2512). This study concludes that the fusion of content and context provides the most comprehensive and accurate representation for news similarity detection.
Downloads
References
J. Wang and Y. Dong, “Measurement of text similarity: A survey,” Inf., vol. 11, no. 9, pp. 1–17, 2020, doi: 10.3390/info11090421.
N. Pradhan, M. Gyanchandani, and R. Wadhvani, “A Review on Text Similarity Technique used in IR and its Application,” Int. J. Comput. Appl., vol. 120, no. 9, pp. 29–34, 2015, doi: 10.5120/21257-4109.
D. K. Wardy, I. K. G. D. Putra, and N. K. D. Rusjayanthi, “Clustering Artikel pada Portal Berita Online,” JITTER- J. Ilm. Teknol. dan Komput., vol. 3, no. 1, pp. 3–11, 2022.
K. Erk, “Vector Space Models of Word Meaning and Phrase Meaning: A Survey,” Linguist. Lang. Compass, vol. 6, no. 10, pp. 635–653, 2012, doi: 10.1002/lnco.362.
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, no. Mlm, pp. 4171–4186, 2019.
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 3982–3992, 2019, doi: 10.18653/v1/d19-1410.
H.-S. Sheu and S. Li, “Context-aware Graph Embedding for Session-based News Recommendation by,” 2020.
A. Grover and J. Leskovec, “node2vec: Scalable Feature Learning for Networks,” HHS Public Access, pp. 607–612, 2016, doi: 10.1145/2939672.2939754.node2vec.
J. Wu, J. Sun, H. Sun, and G. Sun, “Performance Analysis of Graph Neural Network Frameworks,” Proc. - 2021 IEEE Int. Symp. Perform. Anal. Syst. Software, ISPASS 2021, pp. 118–127, 2021, doi: 10.1109/ISPASS51385.2021.00029.
L. Wu et al., “Word mover’s embedding: From word2vec to document embedding,” Proc. 2018 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2018, pp. 4524–4534, 2018, doi: 10.18653/v1/d18-1482.
L. Meng and N. Masuda, “Analysis of node2vec random walks on networks: Node2vec random walks on networks,” Proc. R. Soc. A Math. Phys. Eng. Sci., vol. 476, no. 2243, 2020, doi: 10.1098/rspa.2020.0447.
E. Shushkevich, M. V. Loureiro, L. Mai, S. Derby, and T. K. Wijaya, “SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels,” 2024 Jt. Int. Conf. Comput. Linguist. Lang. Resour. Eval. Lr. 2024 - Main Conf. Proc., pp. 15181–15190, 2024.
X. Wang, X. He, Y. Cao, M. Liu, and T. S. Chua, “KGAT: Knowledge graph attention network for recommendation,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 950–958, 2019, doi: 10.1145/3292500.3330989.
P. Verma, “S2 Chunking: A Hybrid Framework for Document Segmentation Through Integrated Spatial and Semantic Analysis,” pp. 1–10, 2025, [Online]. Available: http://arxiv.org/abs/2501.05485
M. Angelos Goulis Supervisor and dr Clara Stegehuis, “Optimising node2vec in Dynamic Graphs Through Local Retraining,” 2024.
N. Dehak, R. Dehak, J. Glass, D. Reynolds, and P. Kenny, “Cosine similarity scoring without score normalization techniques,” Odyssey 2010 Speak. Lang. Recognit. Work., pp. 71–75, 2010.
J. C. Nacher and T. Akutsu, “Analysis on critical nodes in controlling complex networks using dominating sets,” Proc. - 2013 Int. Conf. Signal-Image Technol. Internet-Based Syst. SITIS 2013, pp. 649–654, 2013, doi: 10.1109/SITIS.2013.106.
B. J. Goode and D. Datta, “A Geometric Approach to Predicting Bounds of Downstream Model Performance,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 1596–1604, 2020, doi: 10.1145/3394486.3403210.
Z. Huang, D. Liang, P. Xu, and B. Xiang, “Multiplicative Position-aware Transformer Models for Language Understanding,” no. usually 512, 2021, [Online]. Available: http://arxiv.org/abs/2109.12788
C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is All You Need in Speech Separation,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.13154
T. Mikolov, W. T. Yih, and G. Zweig, “Linguistic Regularities in Continuous Space Word Representations,” Proc. 2nd Work. Comput. Linguist. Lit. CLfL 2013 2013 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. NAACL-HLT 2013, pp. 746–751, 2015.
L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” 33rd AAAI Conf. Artif. Intell. AAAI 2019, 31st Innov. Appl. Artif. Intell. Conf. IAAI 2019 9th AAAI Symp. Educ. Adv. Artif. Intell. EAAI 2019, pp. 7370–7377, 2019, doi: 10.4000/books.aaccademia.4577.
V. Karpukhin et al., “Dense passage retrieval for open-domain question answering,” EMNLP 2020 - 2020 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., pp. 6769–6781, 2020, doi: 10.18653/v1/2020.emnlp-main.550.
H. Chen, S. F. Sultan, Y. Tian, M. Chen, and S. Skiena, “Fast and accurate network embeddings via very sparse random projection,” Int. Conf. Inf. Knowl. Manag. Proc., pp. 399–408, 2019, doi: 10.1145/3357384.3357879.
T. Chavan and S. Patil, “Named Entity Recognition (Ner) for News Articles,” Int. J. Adv. Res. Eng. Technol., vol. 2, no. 1, pp. 103–112, 2024, doi: 10.34218/ijaird.2.1.2024.10.


Copyright (c) 2025 Journal of Information Systems and Informatics

This work is licensed under a Creative Commons Attribution 4.0 International License.
- I certify that I have read, understand and agreed to the Journal of Information Systems and Informatics (Journal-ISI) submission guidelines, policies and submission declaration. Submission already using the provided template.
- I certify that all authors have approved the publication of this and there is no conflict of interest.
- I confirm that the manuscript is the authors' original work and the manuscript has not received prior publication and is not under consideration for publication elsewhere and has not been previously published.
- I confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- I confirm that the paper now submitted is not copied or plagiarized version of some other published work.
- I declare that I shall not submit the paper for publication in any other Journal or Magazine till the decision is made by journal editors.
- If the paper is finally accepted by the journal for publication, I confirm that I will either publish the paper immediately or withdraw it according to withdrawal policies
- I Agree that the paper published by this journal, I transfer copyright or assign exclusive rights to the publisher (including commercial rights)