中国邮电高校学报(英文) ›› 2012, Vol. 19 ›› Issue (1): 101-111.doi: 10.1016/S1005-8885(11)60234-3

• Others • 上一篇    下一篇

Efficient representation of text with multiple perspectives

平源1,周亚建1,薛超1,杨义先2   

  1. 1. 北京邮电大学
    2. 北京邮电大学信息安全中心
  • 收稿日期:2011-05-17 修回日期:2011-10-09 出版日期:2012-02-28 发布日期:2012-02-21
  • 通讯作者: 平源 E-mail:pingyuan@bupt.edu.cn
  • 基金资助:

    The authors would like to thank Lan M. for valuable suggestions and the handling editor and anonymous reviewers for helping to greatly improve the paper. This work was supported by the Hi-Tech Research and Development Program of China (2009AA01Z430), the National Natural Science Foundation of China (60972077, 60821001), the National S&T Major Program (2010ZX03003-003-01), the Fundamental Research Funds for the Central Universities (BUPT2011RC0210), and the Science and Technology on Electronic Control Laboratory.

Efficient representation of text with multiple perspectives

  • Received:2011-05-17 Revised:2011-10-09 Online:2012-02-28 Published:2012-02-21
  • Contact: Yuan PING E-mail:pingyuan@bupt.edu.cn
  • Supported by:

    The authors would like to thank Lan M. for valuable suggestions and the handling editor and anonymous reviewers for helping to greatly improve the paper. This work was supported by the Hi-Tech Research and Development Program of China (2009AA01Z430), the National Natural Science Foundation of China (60972077, 60821001), the National S&T Major Program (2010ZX03003-003-01), the Fundamental Research Funds for the Central Universities (BUPT2011RC0210), and the Science and Technology on Electronic Control Laboratory.

摘要:

An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.

关键词:

text representation, support vector machine (SVM), class tendency, category contribution, passages

Abstract:

An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.

Key words:

text representation, support vector machine (SVM), class tendency, category contribution, passages

中图分类号: