The Journal of China Universities of Posts and Telecommunications ›› 2019, Vol. 26 ›› Issue (2): 1-8.doi: 10.19682/j.cnki.1005-8885.2019.1001

• Artificial intelligence •     Next Articles

Sentence segmentation for classical Chinese based on LSTM with radical embedding

Han Xu, Wang Hongsu, Zhang Sanqian, Fu Qunchao, Liu Jun   

  1. 1. School of software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2. Key Laboratory of Trustworthy Distributed Computing and Service, (BUPT), Ministry of Education, Beijing 100876, China
    3. The Key Laboratory of Rich-Media Knowledge Organization and Service of Digital Publishing Content, Institute of Scientific and Technical Information of China, Beijing 100038, China
    4. Insitute of Quantitative Social Science, Harvard University, Cambridge, MA, USA
    5. Department of statistics, Harvard University, Cambridge, MA, USA
  • Received:2018-08-07 Revised:2019-03-26 Online:2019-04-30 Published:2019-06-14
  • Contact: Corresponding author: Han Xu, E-mail: fionaxuhan@gmail.com E-mail:fionaxuhan@gmail.com
  • About author:Corresponding author: Han Xu, E-mail: fionaxuhan@gmail.com
  • Supported by:
    This work was supported by the Fund of the key laboratory of rich-media knowledge organization and service of digital publishing content (ZD2018-07/05).

Abstract: A low-than character feature embedding called radical embedding is proposed, and applied on a long-short term memory (LSTM) model for sentence segmentation of pre-modern Chinese texts. The dataset includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles. LSTM-conditional random fields (LSTM-CRF) model is a state-of-the-art method for the sequence labeling problem. This model adds a component of radical embedding, which leads to improved performances. Experimental results based on the aforementioned Chinese books demonstrate better accuracy than earlier methods on sentence segmentation, especial in Tang’s epitaph texts (achieving an F1-score of 81.34%).

Key words: LSTM, radical embedding, sentence segmentation

CLC Number: