Speaker conversion using kernel non-negative matrix factorization

doi:10.1016/S1005-8885(17)60234-6

中国邮电高校学报(英文) ›› 2017, Vol. 24 ›› Issue (5): 60-67.doi: 10.1016/S1005-8885(17)60234-6

• Signal processing • 上一篇下一篇

Speaker conversion using kernel non-negative matrix factorization

Xu Qinyu, Lu Guanming, Yan Jingjie, Li Haibo, Cheng Xiao

1. College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
2. Jiangsu Province Key Laboratory on Image Processing and Image Communication, Nanjing 210003, China

收稿日期:2017-01-23 修回日期:2017-09-29 出版日期:2017-10-30 发布日期:2017-12-18
通讯作者: Lu Guanming, E-mail: lugm@njupt.edu.cn E-mail:lugm@njupt.edu.cn
作者简介:Lu Guanming, E-mail: lugm@njupt.edu.cn
基金资助:
This work was supported in part by the National Natural Science Foundation of China (61501249, 61071167, 41601601), the Key Research and Development Program of Jiangsu Province (BE2016775), the Natural Science Foundation of Jiangsu Province for Youth (BK20150855), Research Project of Science and Technology Department of Jiangsu Province (BY2015011-1), the Natural Science Foundation for Jiangsu Higher Education Institutions (15KJB510022), and the Nanjing University of Posts and Telecommunications Science Foundation (NY214143).

Speaker conversion using kernel non-negative matrix factorization

Xu Qinyu, Lu Guanming, Yan Jingjie, Li Haibo, Cheng Xiao

1. College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
2. Jiangsu Province Key Laboratory on Image Processing and Image Communication, Nanjing 210003, China

Received:2017-01-23 Revised:2017-09-29 Online:2017-10-30 Published:2017-12-18
Contact: Lu Guanming, E-mail: lugm@njupt.edu.cn E-mail:lugm@njupt.edu.cn
About author:Lu Guanming, E-mail: lugm@njupt.edu.cn
Supported by:
This work was supported in part by the National Natural Science Foundation of China (61501249, 61071167, 41601601), the Key Research and Development Program of Jiangsu Province (BE2016775), the Natural Science Foundation of Jiangsu Province for Youth (BK20150855), Research Project of Science and Technology Department of Jiangsu Province (BY2015011-1), the Natural Science Foundation for Jiangsu Higher Education Institutions (15KJB510022), and the Nanjing University of Posts and Telecommunications Science Foundation (NY214143).

摘要/Abstract

摘要：

Voice conversion (VC) based on Gaussian mixture model (GMM) is the most classic and common method which converts the source spectrum to target spectrum. However this method is prone to over-fitting because of its frame-by-frame conversion. The VC with non-negative matrix factorization (NMF) is presented in this paper, which can keep spectrum from over-fitting by adjusting the size of basis vector (dictionary). In order to realize the non-linear mapping better, kernel NMF (KNMF) is adopted to achieve spectrum mapping. In addition, to increase the accuracy of conversion, KNMF combined with GMM (GKNMF) is also introduced into VC. In the end, KNMF, GKNMF, GMM, principal component regression (PCR), PCR combined with GMM (GPCR), partial least square regression (PLSR), NMF correlation-based frequency warping (NMF-CFW) and deep neural network (DNN) methods are compared with each other. The proposed GKNMF gets better performance in both objective evaluation and subjective evaluation.

关键词: VC, kernel, NMF, spectrum mapping

Abstract:

Key words: VC, kernel, NMF, spectrum mapping

中图分类号:

TN912.3

Xu Qinyu, Lu Guanming, Yan Jingjie, Li Haibo, Cheng Xiao. Speaker conversion using kernel non-negative matrix factorization[J]. JOURNAL OF CHINA UNIVERSITIES OF POSTS AND TELECOM, 2017, 24(5): 60-67.

参考文献

References

1. Abe M, Nakamura S, Shikano K, et al. Voice conversion through vector quantization. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’88): Vol 1, Apr 11-14, 1988, New York, NY, USA. Piscataway, NJ, USA: IEEE, 1988: 655-658

2. Valbret H, Moulines E, Tubach J P. Voice transformation using PSOLA technique. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’92): Vol 1, Mar 23-26, 1992, San Francisco, CA, USA. Piscataway, NJ, USA: IEEE, 1992: 145–148

3. Stylianou Y, Cappé O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 1998, 6(2): 131-142

4. Kain A, Macon M W. Spectral voice conversion for text-to-speech synthesis. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’98): Vol 1, May 12-15, 1998, Seattle, WA, USA. Piscataway, NJ, USA: IEEE, 1998: 285-288

5. Toda T, Black A W, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8): 2222-2235

6. Toda T, Ohtani Y, Shikano K. Eigenvoice conversion based on Gaussian mixture model. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH’06/ICSLP’06), Sept 17-21, 2006, Pittsburgh, PA, USA. 2006: 2446-2449

7. Helander E, Virtanen T, Nurminen J, et al. Voice conversion using partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 912-921

8. Raj B, Singh R, Virtanen T. Phoneme-dependent NMF for speech enhancement in monaural mixtures. Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH’11), Aug 27-31, 2011, Florence, Italy. 2011: 1217-1220

9. Zhu B, Li W, Li R, et al. Multi-stage non-negative matrix factorization for monaural singing voice separation. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(10): 2096-2107

10. Takashima R, Takiguchi T, Ariki Y. Exemplar-based voice conversion in noisy environment. Proceedings of the 2012 IEEE Workshop on Spoken Language Technology (SLT’12), Dec 2-5, 2012, Miami, FL, USA. Piscataway, NJ, USA: IEEE, 2012: 313-317

11. Aihara R, Takashima R, Takiguchi T, et al. Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSSP’13), May 26-31, 2013, Vancouver, Canada. Piscataway, NJ, USA: IEEE, 2013: 8037-8040

12. Aihara R, Nakashika T, Takiguchi T, et al. Voice conversion based on non-negative matrix factorization using phoneme-categorized dictionary. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14), May 4-9, 2014, Florence, Italy. Piscataway, NJ, USA: IEEE, 2014: 7894-7898

13. Aihara R, Takiguchi T, Ariki Y. Activity-mapping non-negative matrix factorization for exemplar-based voice conversion. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15), Apr 19-24, 2015, Brisbane, Australia. Piscataway, NJ, USA: IEEE, 2015: 4899-4903

14. Wu Z Z, Chng E S, Li H Z. Exemplar-based voice conversion using joint nonnegative matrix factorization. Multimedia Tools and Applications, 2015, 74(22): 9943-9958

15. Berry M W, Brown M, Langvill A N, et al. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 2007, 52(1): 155-173

16. Kominek J, Black A W. The CMU arctic speech databases. Proceedings of the 5th ISCA Speech Synthesis Workshop, Jun 14-16, 2004, Pittsburgh, PA, USA. International Speech Communication Association (ISCA), 2004: 223-224

17. Kawahara H, Morise M, Takahashi T, et al. Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’08), Mar 31-Apr 4, 2008, Las Vegas, NV, USA. Piscataway, NJ, USA: IEEE, 2008: 3933-3936

18. Speech Signal Processing Toolkit (SPTK). https://sourceforge.net/projects/ sp-tk/files/SPTK/.

19. Tian X H, Wu Z Z, Lee S W, et al. Sparse representation for frequency warping based voice conversion. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15), Apr 19-24, 2015, Brisbane, Australia. Piscataway, NJ, USA: IEEE, 2015: 4235-4239

20. Mohammadi S H, Kain A. Voice conversion using deep neural networks with speaker-independent pre-training. Proceedings of the 2014 IEEE Workshop on Spoken Language Technology (SLT’14), Dec 7-10, 2014, South Lake Tahoe, NV, USA. Piscataway, NJ, USA: IEEE, 2014: 19-23

21. Mohammadi S H, Kain A. Semi-supervised training of a voice conversion mapping function using a joint-autoencoder. Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH’15), Sept 6-10, 2015, Dresden, Germany. Piscataway, NJ, USA: IEEE, 2015: 284-288

22. Inanoglu Z. Transforming pitch in a voice conversion framework. Edmond, UK: St. Edmond’s College, University of Cambridge, 2003

23. Erro D, Moreno A, Bonafonte A. INCA algorithm for training voice conversion systems from nonparallel corpora. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(5): 944-953

24. Chen L H, Ling Z H, Liu L J, et al. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12): 1859-1872

[1]	明萌王珂纪红. Novel DTD and VAD assisted voice detection algorithm for VoIP systems[J]. 中国邮电高校学报(英文版), 2016, 23(4): 9-16.
[2]	赵剑董远赵贤宇杨浩王海拉. Cross similarity measurement for speaker adaptive test normalization in text-independent speaker verification[J]. Acta Metallurgica Sinica(English letters), 2008, 15(2): 130-134.
[3]	邓宗元;劭羲;杨震. LSB steganalysis of speech data based on distance measure and ML decision[J]. Acta Metallurgica Sinica(English letters), 2007, 14(3): 103-107.
[4]	YING Na, ZHAO Xiao-hui, DONG Jing. Unvoiced/voiced classification and voiced harmonic parameters estimation using the third-order statistics[J]. Acta Metallurgica Sinica(English letters), 2007, 14(1): 85-89.

Speaker conversion using kernel non-negative matrix factorization

Speaker conversion using kernel non-negative matrix factorization

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价