QLSTM-based Joint-Training for Noise Robust Hindi Speech Recognition

Mr. Ankit Kumar

doi:https://doi.org/10.17492/computology.v1i1.2102

Submit Manuscript Login / Register Subscribe

Home

Editorial Board Members

Mission, Aims & Scope

Current Issue

QLSTM-based Joint-Training for Noise Robust Hindi Speech Recognition

Ankit Kumar

https://doi.org/10.17492/computology.v1i1.2102

Published Online: June 15, 2021

Author Details ( * ) denotes Corresponding author

1. * Ankit Kumar, Department of Computer Science & Information Technology, KIET Group of Institution, Ghaziabad, Uttar Pradesh, India (anketvit@gmail.com)

In recent years, the field of speech recognition has benefited more from deep learning. The substantial improvement was reported by current technology; how- ever, speech recognition did not work well in a noisy environment. Improving speech recognition in noisy conditions is a critical task. The goal of this work is to propose a high accuracy noise-robust Hindi speech recognition system. In this series, we apply Bi-directional Quaternion Long-Short-Term Memory (QLSTM) neural network to train the speech enhancement and speech recognition model jointly. The role of the i-vector and Recurrent Neural Network (RNN) language model is also investigated. Using a 2.5-hour Hindi speech dataset and the Kaldi and Pytorch-Kaldi toolkit, all of the experiments were carried out. The proposed model reports the 2% Word Error Rate (WER) reduction over the state-of-the-art (SOTA) techniques.

Keywords

Quaternion Neural Network; Joint-training; Hindi Speech Recognition; Noise-Robusr ASR

M. Brandstein and D. Ward, Microphone arrays, 2002.

M. Dua, R. K. Aggarwal and M. Biswas, Performance evaluation of Hindi speech recognition sys- tem using optimized filter banks, Engineering Science and Technology, an International Journal, 21 (2018), 389–398.

M. Dua, R. K. Aggarwal and M. Biswas, GFCC based discriminatively trained noise robust continuous ASR system for Hindi language, Journal of Ambient Intelligence and Humanized Computing, 10 (2019), 2301–2314.

T. Gao, J. Du, L.-R. Dai and C.-H. Lee, Joint training of front-end and back-end deep neural networks for robust speech recognition, in 2015 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, 4375–4379.

F. Ge, K. Li, B. Wu, S. M. Siniscalchi, Y. Yan and C.-H. Lee, Joint training of multi-channel- condition de-reverberation and acoustic modeling of microphone array speech for robust distant speech recognition in Interspeech, 2017, 3847–3851.

I. Goodfellow, Y. Bengio and A. Courville, Deep learning, MIT press, 2016.

E. Hänsler and G. Schmidt, Speech and audio processing in adverse environments, Springer Science & Business Media, 2008.

J. Hu and J. Wang, Global stability of complex- valued recurrent neural networks with time-delays, IEEE Transactions on Neural Networks and Learning Systems, 23 (2012), 853–865.

S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li and Y. Zhang, Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, 6124–6128.

S. Makino, T.-W. Lee and H. Sawada, Blind speech separation, vol. 615, Springer, 2007.

L. R. Medsker and L. Jain, Recurrent neural net- works, Design and Applications, 5.

T. Parcollet, M. Morchid and G. Linarès, A survey of quaternion neural networks, Artificial Intelligence Review, 53 (2020), 2957–2982.

T. Parcollet, M. Morchid, G. Linarès and R. De Mori, Bidirectional quaternion long short- term memory recurrent neural networks for speech recognition, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, 8519–8523.

T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori and Y. Bengio, Quater- nion recurrent neural networks, arXiv preprint arXiv:1806.04418.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., The kaldi speech recognition toolkit, in IEEE 2011 workshop on automatic speech recognition and understanding, CONF, IEEE Signal Processing Society, 2011.

M. Ravanelli and Y. Bengio, Interpretable convolutional filters with sincnet, arXiv preprint arXiv:1811.09725.

M. Ravanelli, P. Brakel, M. Omologo and Y. Ben- gio, Batch-normalized joint training for dnn-based distant speech recognition, in 2016 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2016, 28–34.

M. Ravanelli, P. Brakel, M. Omologo and Y. Ben- gio, Light gated recurrent units for speech recognition, IEEE Transactions on Emerging Topics in Computational Intelligence, 2 (2018), 92–102.

M. Ravanelli, T. Parcollet and Y. Bengio, The pytorch-kaldi speech recognition toolkit, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, 6465–6469.

V. Roger, J. Farinas and J. Pinquier, Deep neural networks for automatic speech processing: A survey from large corpora to limited data, arXiv preprint arXiv:2003.04241.

K. Samudravijaya, P. Rao and S. Agrawal, Hindi speech database, in Sixth International Conference on Spoken Language Processing, 2000.

Y. Shangguan, J. Li, L. Qiao, R. Alvarez and I. McGraw, Optimizing speech recognition for the edge, arXiv preprint arXiv:1909.12408.

J. Song and Y. Yam, Complex recurrent neural net- work for computing the inverse and pseudo-inverse of the complex matrix, Applied mathematics and computation, 93 (1998), 195–205.

Z.-Q. Wang and D. Wang, A joint training framework for robust automatic speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24 (2016), 796–806.

F. Weninger, H. Erdogan, S. Watanabe, E. Vincent,

J. Le Roux, J. R. Hershey and B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, in International Conference on Latent Variable Analysis and Signal Separation, Springer, 2015, 91–99.