中国邮电高校学报(英文) ›› 2019, Vol. 26 ›› Issue (2): 1-8.doi: 10.19682/j.cnki.1005-8885.2019.1001

• Artificial Intelligence •    下一篇

Sentence segmentation for classical Chinese based on LSTM with radical embedding

Han Xu, Wang Hongsu, Zhang Sanqian, Fu Qunchao, Liu Jun   

  1. 1. School of software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2. Key Laboratory of Trustworthy Distributed Computing and Service, (BUPT), Ministry of Education, Beijing 100876, China
    3. The Key Laboratory of Rich-Media Knowledge Organization and Service of Digital Publishing Content, Institute of Scientific and Technical Information of China, Beijing 100038, China
    4. Insitute of Quantitative Social Science, Harvard University, Cambridge, MA, USA
    5. Department of statistics, Harvard University, Cambridge, MA, USA
  • 收稿日期:2018-08-07 修回日期:2019-03-26 出版日期:2019-04-30 发布日期:2019-06-14
  • 通讯作者: Corresponding author: Han Xu, E-mail: fionaxuhan@gmail.com E-mail:fionaxuhan@gmail.com
  • 作者简介:Corresponding author: Han Xu, E-mail: fionaxuhan@gmail.com
  • 基金资助:
    This work was supported by the Fund of the key laboratory of rich-media knowledge organization and service of digital publishing content (ZD2018-07/05).

Sentence segmentation for classical Chinese based on LSTM with radical embedding

Han Xu, Wang Hongsu, Zhang Sanqian, Fu Qunchao, Liu Jun   

  1. 1. School of software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2. Key Laboratory of Trustworthy Distributed Computing and Service, (BUPT), Ministry of Education, Beijing 100876, China
    3. The Key Laboratory of Rich-Media Knowledge Organization and Service of Digital Publishing Content, Institute of Scientific and Technical Information of China, Beijing 100038, China
    4. Insitute of Quantitative Social Science, Harvard University, Cambridge, MA, USA
    5. Department of statistics, Harvard University, Cambridge, MA, USA
  • Received:2018-08-07 Revised:2019-03-26 Online:2019-04-30 Published:2019-06-14
  • Contact: Corresponding author: Han Xu, E-mail: fionaxuhan@gmail.com E-mail:fionaxuhan@gmail.com
  • About author:Corresponding author: Han Xu, E-mail: fionaxuhan@gmail.com
  • Supported by:
    This work was supported by the Fund of the key laboratory of rich-media knowledge organization and service of digital publishing content (ZD2018-07/05).

摘要: A low-than character feature embedding called radical embedding is proposed, and applied on a long-short term memory (LSTM) model for sentence segmentation of pre-modern Chinese texts. The dataset includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles. LSTM-conditional random fields (LSTM-CRF) model is a state-of-the-art method for the sequence labeling problem. This model adds a component of radical embedding, which leads to improved performances. Experimental results based on the aforementioned Chinese books demonstrate better accuracy than earlier methods on sentence segmentation, especial in Tang’s epitaph texts (achieving an F1-score of 81.34%).

关键词: LSTM, radical embedding, sentence segmentation

Abstract: A low-than character feature embedding called radical embedding is proposed, and applied on a long-short term memory (LSTM) model for sentence segmentation of pre-modern Chinese texts. The dataset includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles. LSTM-conditional random fields (LSTM-CRF) model is a state-of-the-art method for the sequence labeling problem. This model adds a component of radical embedding, which leads to improved performances. Experimental results based on the aforementioned Chinese books demonstrate better accuracy than earlier methods on sentence segmentation, especial in Tang’s epitaph texts (achieving an F1-score of 81.34%).

Key words: LSTM, radical embedding, sentence segmentation

中图分类号: