Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.

PLoS One

AuDIaS - Audio, Data Intelligence and Speech, Universidad Autónoma de Madrid, Madrid, Spain.

Published: March 2019

Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6179252PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0205355PLOS

Publication Analysis

Top Keywords

multi-resolution speech
16
speech analysis
16
speech
11
analysis automatic
8
automatic speech
8
speech recognition
8
deep neural
8
neural networks
8
asr systems
8
time-frequency resolution
8

Similar Publications

A temporal-spectral generative adversarial network based end-to-end packet loss concealment for wideband speech transmission.

J Acoust Soc Am

October 2021

Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China.

Packet loss concealment (PLC) aims to mitigate speech impairments caused by packet losses so as to improve speech perceptual quality. This paper proposes an end-to-end PLC algorithm with a time-frequency hybrid generative adversarial network, which incorporates a dilated residual convolution and the integration of a time-domain discriminator and frequency-domain discriminator into a convolutional encoder-decoder architecture. The dilated residual convolution is employed to aggregate the short-term and long-term context information of lost speech frames through two network receptive fields with different dilation rates, and the integrated time-frequency discriminators are proposed to learn multi-resolution time-frequency features from correctly received speech frames with both time-domain waveform and frequency-domain complex spectrums.

View Article and Find Full Text PDF

Continuous dimensional emotion recognition from speech helps robots or virtual agents capture the temporal dynamics of a speaker's emotional state in natural human-robot interactions. Temporal modulation cues obtained directly from the time-domain model of auditory perception can better reflect temporal dynamics than the acoustic features usually processed in the frequency domain. Feature extraction, which can reflect temporal dynamics of emotion from temporal modulation cues, is challenging because of the complexity and diversity of the auditory perception model.

View Article and Find Full Text PDF

Advancements in tele-medicine have led to the development of portable and cheap hand-held retinal imaging devices. However, the images obtained from these devices have low resolution (LR) and poor quality that may not be suitable for retinal disease diagnosis. Therefore, this paper proposes a novel framework for the super-resolution (SR) of the LR fundus images.

View Article and Find Full Text PDF

Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR.

View Article and Find Full Text PDF

Unlabelled: A speech intelligibility prediction model is proposed that combines the auditory processing front end of the multi-resolution speech-based envelope power spectrum model [mr-sEPSM; Jørgensen, Ewert, and Dau (2013). J. Acoust.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!