Acta Metallurgica Sinica(English letters) ›› 2012, Vol. 19 ›› Issue (1): 101-111.doi: 10.1016/S1005-8885(11)60234-3

• Others • Previous Articles     Next Articles

Efficient representation of text with multiple perspectives

  

  • Received:2011-05-17 Revised:2011-10-09 Online:2012-02-28 Published:2012-02-21
  • Contact: Yuan PING E-mail:pingyuan@bupt.edu.cn
  • Supported by:

    The authors would like to thank Lan M. for valuable suggestions and the handling editor and anonymous reviewers for helping to greatly improve the paper. This work was supported by the Hi-Tech Research and Development Program of China (2009AA01Z430), the National Natural Science Foundation of China (60972077, 60821001), the National S&T Major Program (2010ZX03003-003-01), the Fundamental Research Funds for the Central Universities (BUPT2011RC0210), and the Science and Technology on Electronic Control Laboratory.

Abstract:

An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.

Key words:

text representation, support vector machine (SVM), class tendency, category contribution, passages

CLC Number: