Efficient representation of text with multiple perspectives

doi:10.1016/S1005-8885(11)60234-3

Acta Metallurgica Sinica(English letters) ›› 2012, Vol. 19 ›› Issue (1): 101-111.doi: 10.1016/S1005-8885(11)60234-3

• Others • Previous Articles Next Articles

Efficient representation of text with multiple perspectives

Received:2011-05-17 Revised:2011-10-09 Online:2012-02-28 Published:2012-02-21
Contact: Yuan PING E-mail:pingyuan@bupt.edu.cn
Supported by:
The authors would like to thank Lan M. for valuable suggestions and the handling editor and anonymous reviewers for helping to greatly improve the paper. This work was supported by the Hi-Tech Research and Development Program of China (2009AA01Z430), the National Natural Science Foundation of China (60972077, 60821001), the National S&T Major Program (2010ZX03003-003-01), the Fundamental Research Funds for the Central Universities (BUPT2011RC0210), and the Science and Technology on Electronic Control Laboratory.

Abstract

Abstract:

An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.

Key words:

text representation, support vector machine (SVM), class tendency, category contribution, passages

CLC Number:

TP181

References

1. Chen J, Huang H, Tian S, et al. Feature selection for text classification with naive bayes. Expert Systems with Applications, 2009, 36(3): 5432-5435

2. Tan S. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications, 2005, 28(4): 667-671

3. Joachims T. Learning to classify text using support vector machines: Methods, Theory and Algorithms. Dordrecht, Netherlands: Kluwer Academic Publishers, 2002

4. Leopold E, Kindermann J. Text categorization with support vector machines. how to represent texts in input space?. Machine Learning, 2002, 46(1/2/3): 423-444

5. Xue H, Chen S C, Yang Q. Structural regularized support vector machine: a framework for structural large margin classifier. IEEE Transactions on Neural Networks, 2011, 22(4): 573-587

6. Xue X B, Zhou Z H. Distributional features for text categorization. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(3): 428-442

7. Qi X G, Davison B D. Web page classification: features and algorithms. ACM Computing Surveys, 2009, 41(2): 1-31

8. Lan M, Tan C, Low H, et al. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. Proceedings of the 14th International Conference on World Wide Web (WWW’05), May 10-14, 2005, Chiba, Japan. New York, NY, USA: ACM, 2005: 1032-1033

9. Lan M, Tan C L, Su J et al. Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(4): 721-735

10. Alt?nçay H, Erenel Z. Analytical evaluation of term weighting schemes for text categorization. Pattern Recognition Letters, 2010, 31(11): 1310-1323

11. Quan X J, Liu W Y, Qiu B T. Term weighting schemes for question categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(5): 1009-1021

12. Isa D, Lee L H, Kallimani V. A polychotomizer for case-based reasoning beyond the traditional bayesian classification approach. Journal of Computer and Information Science, 2008, 1(1): 57-68

13. Isa D, Lee L H, Kallimani V P, et al. Text document preprocessing with the bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(9): 1264-1272

14. Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988, 24(5): 513-523

15. Joachims T. Text categorization with support vector machines: learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning (ECML’98), Apr 21-24, 1998, Chemnitz, Germany. LNCS 1398. Berlin, Germany: Springer-Verlag, 1998: 137-142

16. Debole F, Sebastiani F. Supervised term weighting for automated text categorization. Proceedings of the 20th Annual ACM Symposium on Applied Computing (SAC’03), Mar 9-12, 2003, Melbourne, FL, USA. New York, NY, USA: ACM, 2003: 784-788

17. Ping Y, Zhou Y J, Yang Y X, et al. A novel term weighting scheme with distributional coefficient for text categorization with support vector machine. Proceedings of the IEEE 2nd Youth Conference on Information, Computing and Telecommunications (YCICT’10), Nov 28-30, 2010, Beijing, China. Piscataway, NJ, USA: IEEE, 2010: 182-185

18. Ko Y, Park J, Seo J. Improving text categorization using the importance of sentences. Information Processing and Management, 2004, 40(1): 65-79

19. Kim J, Kim M J. An evaluation of passage-based text categorization. Journal of Intelligent Information Systems, 2004, 23(1): 47-65

20. Tseng C Y, Sung P C, Chen M S. Cosdes: a collaborative spam detection system with a novel e-mail abstraction scheme. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(5): 669-682

21. Callan J P. Passage retrieval evidence in document retrieval. Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), Jul 3-6, 1994, Dublin, Ireland. New York, NY, USA: ACM, 1994: 302-310

22. Lertnattee V, Theeramunkong T. Effect of term distributions on centroid-based text categorization. Information Sciences, 2004, 158(1): 89-115

23. Guan H, Zhou J Y, Guo M Y. A class-feature-centroid classifier for text categorization. Proceedings of the 18th International Conference on World Wide Web (WWW’09), Apr 20-24, 2009, Madrid, Spain. New York, NY, USA: ACM, 2009: 201-210

24. Soucy P, Mineau G W. Beyond TFIDF weighting for text categorization in the vector space model. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI’05), Jul 30-Aug 5, Edinburgh, UK. Menlo Park, CA, USA: AAAI Press, 2005: 1130-1135

25. Isa D, Kallimani V P, Lee L H. Using the self organizing map for clustering of text documents. Expert Systems with Applications, 2009, 36(5): 9584-9591

26. Porter M. An algorithm for suffix stripping. Program, 1980, 14(3): 130-137

27. Lang K. NewsWeeder: learning to filter netnews. Proceedings of the 12th International Conference on Machine Learning (ICML’95), Jul 9-12, 1995, Tahoe City, CA, USA. San Francisco, CA, USA: Morgan Kaufmann Publishers, 1995: 331-339

28. Graven M, DiPasquo D, Freitag D, et al. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of the 15th National Conference for Artificial Intelligence (AAAI’98), Jul 26-30, 1998, Madison, WI, USA. Cambridge, MA, USA: MIT Press, 1998: 509-516

29. Fan R E, Chang K W, Hsieh C J, et al. Liblinear: a library for large linear classification. Journal of Machine Learning Research, 2008, 9: 1871-1874

Metrics

Comments

Copyright © 2020 The Journal of China Universities of Posts and Telecommunications
　 Adress: P.O. Box 231,Beijing University of Posts and Telecommunications,10 Xi Tucheng Road,Beijing 100876,P.R.China　Post Code: 100081
Tel：86-010-62282493　Fax： 86-010-62283461　E-mail: jchupt@bupt.edu.cn
Support by: Beijing Magtech Co.Ltd

Efficient representation of text with multiple perspectives

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 0

Recommended Articles

Metrics

Comments