Sentence segmentation for classical Chinese based on LSTM with radical embedding

doi:10.19682/j.cnki.1005-8885.2019.1001

The Journal of China Universities of Posts and Telecommunications ›› 2019, Vol. 26 ›› Issue (2): 1-8.doi: 10.19682/j.cnki.1005-8885.2019.1001

• Artificial intelligence • Next Articles

Sentence segmentation for classical Chinese based on LSTM with radical embedding

Han Xu, Wang Hongsu, Zhang Sanqian, Fu Qunchao, Liu Jun

1. School of software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
2. Key Laboratory of Trustworthy Distributed Computing and Service, (BUPT), Ministry of Education, Beijing 100876, China
3. The Key Laboratory of Rich-Media Knowledge Organization and Service of Digital Publishing Content, Institute of Scientific and Technical Information of China, Beijing 100038, China
4. Insitute of Quantitative Social Science, Harvard University, Cambridge, MA, USA
5. Department of statistics, Harvard University, Cambridge, MA, USA

Received:2018-08-07 Revised:2019-03-26 Online:2019-04-30 Published:2019-06-14
Contact: Corresponding author: Han Xu, E-mail: fionaxuhan@gmail.com E-mail:fionaxuhan@gmail.com
About author:Corresponding author: Han Xu, E-mail: fionaxuhan@gmail.com
Supported by:
This work was supported by the Fund of the key laboratory of rich-media knowledge organization and service of digital publishing content (ZD2018-07/05).

Abstract

Abstract: A low-than character feature embedding called radical embedding is proposed, and applied on a long-short term memory (LSTM) model for sentence segmentation of pre-modern Chinese texts. The dataset includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles. LSTM-conditional random fields (LSTM-CRF) model is a state-of-the-art method for the sequence labeling problem. This model adds a component of radical embedding, which leads to improved performances. Experimental results based on the aforementioned Chinese books demonstrate better accuracy than earlier methods on sentence segmentation, especial in Tang’s epitaph texts (achieving an F1-score of 81.34%).

Key words: LSTM, radical embedding, sentence segmentation

CLC Number:

TP311.5

Han Xu, Wang Hongsu, Zhang Sanqian, Fu Qunchao, Liu Jun. Sentence segmentation for classical Chinese based on LSTM with radical embedding[J]. The Journal of China Universities of Posts and Telecommunications, 2019, 26(2): 1-8.

References

References
1. Xue N, Shen L. Chinese word segmentation as LMR tagging. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing-Volume 17. Association for Computational
Linguistics, 2003: 176 -179
2. Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields. Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 2004: 562 -568
3. Chen X, Qiu X, Zhu C, et al. Long short-term memory neural networks for Chinese word segmentation. Conference on Empirical Methods in Natural Language Processing, 2015: 1197 -1206
4. Zheng X, Chen H, Xu T. Deep learning for Chinese word segmentation and POS tagging. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013: 647 -657
5. Li Z, Sun M. Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics, 2009, 35 (4): 505 -512
6. Huang J, Hou H. A research on punctuation pattern of ancient agricultural text. Journal of Chinese Information, 2008, 22(4): 31 -38 (in Chinese)
7. Chen T, Chen R, Pan L, et al. Sentence segmentation in ancient Chinese based on N-gram model. Computer Engineering, 2007, 33(3): 192 -193 (in Chinese)
8. Zhang K, Xia Y, Yu H. An ancient Chinese punctuation and sentence marking method based on cascaded CRF. Computer Application Research, 2009(10): 40p (in Chinese)
9. Huang H H, Sun C T, Chen H H. Classical Chinese sentence segmentation. CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010
10. Wang B, Shi X, Tan Z, et al. A sentence segmentation method for ancient Chinese texts based on NNLM. The Workshop on Chinese Lexical Semantics. Springer International Publishing, 2016: 387 -396
11. Rumelhart D E, Hinton G E, Williams R J. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, 1985
12. Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3(2): 1137 -1155
13. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735 -1780
14. Jozefowicz R, Zaremba W, Sutskever I. An empirical exploration of recurrent network architectures. International Conference on Machine Learning, 2015: 2342 -2350
15. Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks. International Conference on Machine Learning, 2013: 1310 -1318
16. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. ArXiv Preprint ArXiv: 1508. 01991, 2015
17. Ma X, Hovy E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. ArXiv Preprint ArXiv: 1603. 01354, 2016
18. Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition. ArXiv Preprint ArXiv: 1603. 01360, 2016
19. Shao Y, Hardmeier C, Tiedemann J, et al. Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. ArXiv Preprint ArXiv: 1704. 01314, 2017
20. Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011, 12(8): 2493 -2537
21. Compilation of Xinhua Dictionary. Xinhua dictionary. 10th edition. Commercial Press, 2004 (in Chinese)
22. Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. ArXiv Preprint ArXiv: 1301. 3781, 2013

Metrics

Comments

Copyright © 2020 The Journal of China Universities of Posts and Telecommunications
　 Adress: P.O. Box 231,Beijing University of Posts and Telecommunications,10 Xi Tucheng Road,Beijing 100876,P.R.China　Post Code: 100081
Tel：86-010-62282493　Fax： 86-010-62283461　E-mail: jchupt@bupt.edu.cn
Support by: Beijing Magtech Co.Ltd

[1]	Huang Junfu, Wang Yawen, Gong Yunzhan, Jin Dahai. Cross-project software defect prediction based on multi-source data sets [J]. The Journal of China Universities of Posts and Telecommunications, 2021, 28(4): 75-87.
[2]	. Multi-objective test case prioritization based on multi-population cooperative particle swarm optimization [J]. The Journal of China Universities of Posts and Telecommunications, 2020, 27(1): 38-50.

Sentence segmentation for classical Chinese based on LSTM with radical embedding

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 2

Recommended Articles

Metrics

Comments