Research on calculation method of text similarity based on smooth inverse frequency

doi:10.19682/j.cnki.1005-8885.2020.1007

中国邮电高校学报(英文) ›› 2020, Vol. 27 ›› Issue (2): 56-64.doi: 10.19682/j.cnki.1005-8885.2020.1007

• Artificial Intelligence • 上一篇下一篇

Research on calculation method of text similarity based on smooth inverse frequency

Yuan Ye, Yu Minmin, Liu Jiming

Key Laboratory of E-commerce and Modern Logistics, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

收稿日期:2019-12-30 修回日期:2020-05-09 出版日期:2020-04-30 发布日期:2020-07-07
通讯作者: Yuan Ye, E-mail: yuanye@cqupt.edu.cn E-mail:yuanye@cqupt.edu.cn
作者简介:Yuan Ye, E-mail: yuanye@cqupt.edu.cn
基金资助:
This work was supported by Chongqing Education Committee (20SKGH059).

Research on calculation method of text similarity based on smooth inverse frequency

Yuan Ye, Yu Minmin, Liu Jiming

Key Laboratory of E-commerce and Modern Logistics, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Received:2019-12-30 Revised:2020-05-09 Online:2020-04-30 Published:2020-07-07
Contact: Yuan Ye, E-mail: yuanye@cqupt.edu.cn E-mail:yuanye@cqupt.edu.cn
About author:Yuan Ye, E-mail: yuanye@cqupt.edu.cn
Supported by:
This work was supported by Chongqing Education Committee (20SKGH059).

摘要/Abstract

摘要： In order to improve the accuracy of text similarity calculation, this paper presents a text similarity function part of speech and word order-smooth inverse frequency (PO-SIF) based on sentence vector, which optimizes the classical SIF calculation method in two aspects: part of speech and word order. The classical SIF algorithm is to calculate sentence similarity by getting a sentence vector through weighting and reducing noise. However, the different methods of weighting or reducing noise would affect the efficiency and the accuracy of similarity calculation. In our proposed PO-SIF, the weight parameters of the SIF sentence vector are first updated by the part of speech subtraction factor, to determine the most crucial words. Furthermore, PO-SIF calculates the sentence vector similarity taking into the account of word order, which overcomes the drawback of similarity analysis that is mostly based on the word frequency. The experimental results validate the performance of our proposed PO-SIF on improving the accuracy of text similarity calculation.

关键词:

word2vec, SIF, part-of-speech, word order similarity

Abstract: In order to improve the accuracy of text similarity calculation, this paper presents a text similarity function part of speech and word order-smooth inverse frequency (PO-SIF) based on sentence vector, which optimizes the classical SIF calculation method in two aspects: part of speech and word order. The classical SIF algorithm is to calculate sentence similarity by getting a sentence vector through weighting and reducing noise. However, the different methods of weighting or reducing noise would affect the efficiency and the accuracy of similarity calculation. In our proposed PO-SIF, the weight parameters of the SIF sentence vector are first updated by the part of speech subtraction factor, to determine the most crucial words. Furthermore, PO-SIF calculates the sentence vector similarity taking into the account of word order, which overcomes the drawback of similarity analysis that is mostly based on the word frequency. The experimental results validate the performance of our proposed PO-SIF on improving the accuracy of text similarity calculation.

Key words:

word2vec, SIF, part-of-speech, word order similarity

Yuan Ye, Yu Minmin, Liu Jiming. Research on calculation method of text similarity based on smooth inverse frequency[J]. The Journal of China Universities of Posts and Telecommunications, 2020, 27(2): 56-64.

参考文献

References
1. Wu C, Wang J, Wang X, et al. Research on Chinese sentence similarity calculation method based on multiple features. 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Oct 12 -14, 2018, Chongqing, China: IEEE, 2018: 1160 -1165
2. Kim D, Seo D, Cho S, et al. Multi-co-training for document classification using various document representations: TF-IDF, LDA, and doc2vec. Information Sciences, 2018, 477(3): 15 -29
3. Rahmah A, Santoso H B, Hasibuan Z A. Exploring technology-enhanced learning key terms using TF-IDF weighting. 2019 Fourth International Conference on Informatics and Computing (ICIC), Oct 16 -17, 2019, Semarang, Indonesia, 2019: 1 -4
4. Song X, Huang J, Zhou J, et al. Research of Chinese text classification methods based on semantic vector and semantic similarity. 2009 International Forum on Computer Science-Technology and Applications, Dec 25 -27, 2009, Chongqing, China: IEEE, 2009: 187 -190
5. Emil B, Daniel V. Tree and word embedding based sentence similarity for evaluation of good answers in intelligent tutoring system. 2017 25th International Conference on Software,
Telecommunications and Computer Networks (SoftCOM), Sep 21 -23, 2017, Split, Croatia: IEEE, 2017: 1 -5
6. Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings. Proceedings of the 5th International Conference on Learning Representations, Apr 24 -26, 2017, Toulon, France, ICLR 2017, 2017: 1 -16
7. Alom M Z, Taha T M, Yakopcic C, et al. A state-of-the-art survey on deep learning theory and architectures. Electronics, 2019, 8(3): 292p
8. Derry J, Moch A B, Arie A S. Word2vec model analysis for semantic similarities in English words. Procedia Computer Science, 2019, 157(8): 160 -167
9. Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017(5): 135 -146
10. Gunasinghe U, De S W A M, de S N, et al. Sentence similarity measuring by vector space model. 2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer), Dec 10 -13, 2014, Colombo, Sri Lanka: IEEE, 2014: 185 -189
11. Hu M, Yang Y, Shen F, et al. Collective reconstructive embeddings for cross-modal hashing. IEEE Transactions on Image Processing, 2018, 28(6): 2770 -2784
12. Bi Y, Deng K, Cheng J X. A keyword-based method for measuring sentence similarity, proceedings of the 2017 ACM on web science, Jun 26 -28, 2017, New York, USA: ACM, 2017: 379 -380
13. Tan Y M, Wang M D. Chinese text implication identification using ordered word moving distance feature. Journal of Beijing University of Posts and Telecommunications, 2017, 40(5): 123 -128
14. Cheng Z Q, Pei H S. Research on sentence similarity algorithm based on vector word order. Computer Simulation, 2014, 31(7): 419 -424
15. Keles M K, Ozel S A. Similarity detection between turkish text documents with distance metrics. 2017 International Conference on Computer Science and Engineering (UBMK), Oct 5 -8, 2017, Antalya: IEEE, 2017: 316 -321
16. Zhang H P, Liu Q. Research on automatic recognition of Chinese names based on role tagging. Journal of Computer Science, 2004, 27(1): 85 -91
17. Haarmann B, Martens C, Petzka H, et al. A mighty dataset for stress-testing question answering system. 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Apr 12, 2018, Laguna Hills, CA, USA: IEEE, 2018: 278 -281

Research on calculation method of text similarity based on smooth inverse frequency

Research on calculation method of text similarity based on smooth inverse frequency

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐

Metrics

本文评价