Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2020, Vol. 14 Issue (2) : 378-387    https://doi.org/10.1007/s11704-018-8030-z
RESEARCH ARTICLE
Sichuan dialect speech recognition with deep LSTM network
Wangyang YING1, Lei ZHANG1(), Hongli DENG1,2
1. Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu 610065, China
2. Education and Information Technology Center, China West Normal University, Nanchong 637002, China
 Download: PDF(610 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

In speech recognition research, because of the variety of languages, corresponding speech recognition systems need to be constructed for different languages. Especially in a dialect speech recognition system, there are many special words and oral language features. In addition, dialect speech data is very scarce. Therefore, constructing a dialect speech recognition system is difficult. This paper constructs a speech recognition system for Sichuan dialect by combining a hidden Markov model (HMM) and a deep long short-term memory (LSTM) network. Using the HMM-LSTM architecture, we created a Sichuan dialect dataset and implemented a speech recognition system for this dataset. Compared with the deep neural network (DNN), the LSTM network can overcome the problem that the DNN only captures the context of a fixed number of information items. Moreover, to identify polyphone and special pronunciation vocabularies in Sichuan dialect accurately, we collect all the characters in the dataset and their common phoneme sequences to form a lexicon. Finally, this system yields a 11.34% character error rate on the Sichuan dialect evaluation dataset. As far as we know, it is the best performance for this corpus at present.

Keywords speech recognition      Sichuan dialect      HMMDNN      HMM-LSTM      Sichuan dialect lexicon     
Corresponding Author(s): Lei ZHANG   
Just Accepted Date: 09 November 2018   Online First Date: 17 September 2019    Issue Date: 16 October 2019
 Cite this article:   
Wangyang YING,Lei ZHANG,Hongli DENG. Sichuan dialect speech recognition with deep LSTM network[J]. Front. Comput. Sci., 2020, 14(2): 378-387.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-018-8030-z
https://academic.hep.com.cn/fcs/EN/Y2020/V14/I2/378
1 D J Berndt, J Clikord. Using dynamic time warping to find patterns in time series. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. 1994, 359–370
2 V Tyagi. Maximum accept and reject (mars) training of hmm-gmm speech recognition systems. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association. 2008, 956–959
3 P C Woodland, J J Odell, V Valtchev, S J Young. Large vocabulary continuous speech recognition using HTK. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. 1994, 125–128
4 M J F Gales. Maximum likelihood linear transformations for hmmbased speech recognition. Computer Speech & Language, 1998, 12(2): 75–98
https://doi.org/10.1006/csla.1998.0043
5 L R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989, 77(2): 257–286
https://doi.org/10.1109/5.18626
6 M Gales, S Young. The application of hidden markov models in speech recognition. Foundations and Trends in Signal Processing, 2008, 1(3): 195–304
https://doi.org/10.1561/2000000004
7 D Jurafsky. Speech & Language Processing. India: Pearson Education, 2000
8 L Zhang, Y Zhang, S Amari. Theoretical study of oscillator neurons in recurrent neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11): 5242–5248
https://doi.org/10.1109/TNNLS.2018.2793911
9 Y Zhang. Foundations of implementing the competitive layer model by lotka-volterra recurrent neural networks. IEEE Transactions on Neural Networks, 2010, 21(3): 494–507
https://doi.org/10.1109/TNN.2009.2039758
10 L Zhang, Y Zhang. Dynamical properties of background neural networks with uniform firing rate and background input. Chaos, Solitons & Fractals, 2007, 33(3): 979–985
https://doi.org/10.1016/j.chaos.2006.01.061
11 J Wang, L Zhang, Q Guo, Y Zhang. Recurrent neural networks with auxiliary memory units. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(5): 1652–1661
https://doi.org/10.1109/TNNLS.2017.2677968
12 Q Guo, J Jia, G Y Shen, L Zhang, L H Cai, Y Zhang. Learning robust uniform features for cross-media social data by using cross autoencoders. Knowledge-Based Systems, 2016, 102: 64–75
https://doi.org/10.1016/j.knosys.2016.03.028
13 L Wang, L Zhang, Y Zhang. Trajectory predictor by using recurrent neural networks in visual tracking. IEEE Transactions on Cybernetics, 2017, 47(10): 3172–3183
https://doi.org/10.1109/TCYB.2017.2705345
14 G Hinton, L Deng, D Yu, G H Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke, P Nguyen, T N Sainath, B Kingsbury. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29(6): 82–97
https://doi.org/10.1109/MSP.2012.2205597
15 S Mohamed, G E Dahl, G Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14–22
https://doi.org/10.1109/TASL.2011.2109382
16 Y Huang, M Slaney, M L Seltzer, Y F Gong. Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association. 2014, 845–849
17 V Manohar, D Povey, S Khudanpur. Semi-supervised maximum mutual information training of deep neural network acoustic models. In: Proceedings of the 6th Annual Conference of the International Speech Communication Association. 2015, 2630–2634
18 R Prabhavalkar, K Rao, T Sainath, B Li, L Johnson, N Jaitly. A comparison of sequence-to-sequence models for speech recognition. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association. 2017, 939–943
https://doi.org/10.21437/Interspeech.2017-233
19 D Amodei, S Ananthanarayanan, R Anubhai, J L Bai, E Battenberg, C Case, J Casper, B Catanzaro, Q Cheng, G L Chen, et al. Deep speech 2: end-to-end speech recognition in english and mandarin. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 173–182
20 W Xiong, L Wu, F Alleva, J Droppo, X Huang, A Stolcke. The microsoft 2017 conversational speech recognition system. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing. 2018, 5934–5938
https://doi.org/10.1109/ICASSP.2018.8461870
21 S Hochreiter, J Schmidhuber. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780
https://doi.org/10.1162/neco.1997.9.8.1735
22 J A Bilmes. A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. International Computer Science Institute, 1998, 4(510): 126–406
23 T Hori, A Nakamura. Speech recognition algorithms using weighted finite-state transducers. Synthesis Lectures on Speech and Audio Processing, 2013, 9(1): 1–162
https://doi.org/10.2200/S00462ED1V01Y201212SAP010
24 G D Forney. The viterbi algorithm. Proceedings of the IEEE, 1973, 61(3): 268–278
https://doi.org/10.1109/PROC.1973.9030
25 A Graves. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 2012
https://doi.org/10.1007/978-3-642-24797-2
26 A Graves, M Liwicki, S Fernández, R Bertolami, H Bunke, J Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 855–868
https://doi.org/10.1109/TPAMI.2008.137
27 P F Brown, P V Desouza, R L Mercer, V J D Pietra, J C Lai. Classbased n-gram models of natural language. Computational Linguistics, 1992, 18(4): 467–479
28 A Senior, H Sak, I Shafran. Context dependent phone models for LSTMRNN acoustic modelling. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing. 2015, 4585–4589
https://doi.org/10.1109/ICASSP.2015.7178839
29 D Povey, A Ghoshal, G Boulianne, L Burget, O Glembek, N Goel, M Hannemann, P Motlicek, Y Qian, P Schwarz, J Silovsky, G Stemmer, K Vesely. The Kaldi speech recognition toolkit. In: Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. 2011
30 B Logan. Mel frequency cepstral coeffcients for music modeling. In: Proceedings of International Conference on Music Information Retrieval. 2000, 1–11
31 R J Williams, D Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, Architectures, and Applications, 1995, 433–486
32 L Bottou, F E Curtis, J Nocedal. Optimization methods for large-scale machine learning. Society for Industrial and Applied Mathematics, 2018, 60(2): 223–311
https://doi.org/10.1137/16M1080173
[1] Article highlights Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed