中国邮电高校学报(英文) ›› 2020, Vol. 27 ›› Issue (2): 56-64.doi: 10.19682/j.cnki.1005-8885.2020.1007

• Artificial Intelligence • 上一篇    下一篇

Research on calculation method of text similarity based on smooth inverse frequency

Yuan Ye, Yu Minmin, Liu Jiming   

  1. Key Laboratory of E-commerce and Modern Logistics, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • 收稿日期:2019-12-30 修回日期:2020-05-09 出版日期:2020-04-30 发布日期:2020-07-07
  • 通讯作者: Yuan Ye, E-mail: yuanye@cqupt.edu.cn E-mail:yuanye@cqupt.edu.cn
  • 作者简介:Yuan Ye, E-mail: yuanye@cqupt.edu.cn
  • 基金资助:
    This work was supported by Chongqing Education Committee (20SKGH059).

Research on calculation method of text similarity based on smooth inverse frequency

Yuan Ye, Yu Minmin, Liu Jiming   

  1. Key Laboratory of E-commerce and Modern Logistics, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2019-12-30 Revised:2020-05-09 Online:2020-04-30 Published:2020-07-07
  • Contact: Yuan Ye, E-mail: yuanye@cqupt.edu.cn E-mail:yuanye@cqupt.edu.cn
  • About author:Yuan Ye, E-mail: yuanye@cqupt.edu.cn
  • Supported by:
    This work was supported by Chongqing Education Committee (20SKGH059).

摘要: In order to improve the accuracy of text similarity calculation, this paper presents a text similarity function part of speech and word order-smooth inverse frequency (PO-SIF) based on sentence vector, which optimizes the classical SIF calculation method in two aspects: part of speech and word order. The classical SIF algorithm is to calculate sentence similarity by getting a sentence vector through weighting and reducing noise. However, the different methods of weighting or reducing noise would affect the efficiency and the accuracy of similarity calculation. In our proposed PO-SIF, the weight parameters of the SIF sentence vector are first updated by the part of speech subtraction factor, to determine the most crucial words. Furthermore, PO-SIF calculates the sentence vector similarity taking into the account of word order, which overcomes the drawback of similarity analysis that is mostly based on the word frequency. The experimental results validate the performance of our proposed PO-SIF on improving the accuracy of text similarity calculation.

关键词:

word2vec, SIF, part-of-speech, word order similarity

Abstract: In order to improve the accuracy of text similarity calculation, this paper presents a text similarity function part of speech and word order-smooth inverse frequency (PO-SIF) based on sentence vector, which optimizes the classical SIF calculation method in two aspects: part of speech and word order. The classical SIF algorithm is to calculate sentence similarity by getting a sentence vector through weighting and reducing noise. However, the different methods of weighting or reducing noise would affect the efficiency and the accuracy of similarity calculation. In our proposed PO-SIF, the weight parameters of the SIF sentence vector are first updated by the part of speech subtraction factor, to determine the most crucial words. Furthermore, PO-SIF calculates the sentence vector similarity taking into the account of word order, which overcomes the drawback of similarity analysis that is mostly based on the word frequency. The experimental results validate the performance of our proposed PO-SIF on improving the accuracy of text similarity calculation.

Key words:

word2vec, SIF, part-of-speech, word order similarity