中国邮电高校学报(英文) ›› 2016, Vol. 23 ›› Issue (5): 40-46.doi: 10.1016/S1005-8885(16)60056-0

• Artificial Intelligence • 上一篇    下一篇

Mining microblog user interests based on TextRank with TF-IDF factor

屠守中,黄民烈   

  1. 清华大学
  • 收稿日期:2016-04-29 修回日期:2016-09-29 出版日期:2016-10-30 发布日期:2016-10-26
  • 通讯作者: 屠守中 E-mail:aquares@163.com
  • 基金资助:
    the National Natural Science Foundation of China (61272227).

Mining microblog user interests based on TextRank with TF-IDF factor

Tu Shouzhong, Huang Minlie   

  • Received:2016-04-29 Revised:2016-09-29 Online:2016-10-30 Published:2016-10-26
  • Contact: Tu Shouzhong E-mail:aquares@163.com
  • Supported by:
    the National Natural Science Foundation of China (61272227).

摘要: It is of great value and significance to model the interests of microblog user in terms of business and sociology. This paper presents a framework for mining and analyzing personal interests from microblog text with a new algorithm which integrates term frequency-inverse document frequency (TF-IDF) with TextRank. Firstly, we build a three-tier category system of user interest based on Wikipedia. In order to obtain the keywords of interest, we preprocess the posts, comments and reposts in different categories to select the keywords which appear both in the category system and microblogs. We then assign weight to each category and calculate the weight of keyword to get TF-IDF factors. Finally we score the ranking of each keyword by the TextRank algorithm with TF-IDF factors. Experiments on real Sina microblog data demonstrate that the precision of our approach significantly outperforms other existing methods.

关键词: microblog, interest feature, TF-IDF interest mining, TextRank

Abstract: It is of great value and significance to model the interests of microblog user in terms of business and sociology. This paper presents a framework for mining and analyzing personal interests from microblog text with a new algorithm which integrates term frequency-inverse document frequency (TF-IDF) with TextRank. Firstly, we build a three-tier category system of user interest based on Wikipedia. In order to obtain the keywords of interest, we preprocess the posts, comments and reposts in different categories to select the keywords which appear both in the category system and microblogs. We then assign weight to each category and calculate the weight of keyword to get TF-IDF factors. Finally we score the ranking of each keyword by the TextRank algorithm with TF-IDF factors. Experiments on real Sina microblog data demonstrate that the precision of our approach significantly outperforms other existing methods.

Key words: microblog, interest feature, TF-IDF interest mining, TextRank

中图分类号: