Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture-the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9370936 | PMC |
http://dx.doi.org/10.3390/s22155501 | DOI Listing |
eNeuro
January 2025
Neurophysiology of Everyday Life Group, Department of Psychology, Carl von Ossietzky Universität Oldenburg, Oldenburg 26129, Germany
A comprehensive analysis of everyday sound perception can be achieved using electroencephalography (EEG) with the concurrent acquisition of information about the environment. While extensive research has been dedicated to speech perception, the complexities of auditory perception within everyday environments, specifically the types of information and the key features to extract, remain less explored. Our study aims to systematically investigate the relevance of different feature categories: discrete sound-identity markers, general cognitive state information, and acoustic representations, including discrete sound onset, the envelope, and mel-spectrogram.
View Article and Find Full Text PDFDigit Health
December 2024
Ostbayerische Technische Hochschule (OTH) Regensburg, Faculty of Health and Social Sciences; Nursing Science, Germany.
J Psycholinguist Res
November 2024
Department of Psychology, University of Milan-Bicocca, Piazza Dell'Ateneo Nuovo, 1, 20126, Milan, Italy.
To avoid misunderstandings, ironic speakers may accompany their ironic remarks with a particular intonation and specific facial expressions that signal that the message should not be taken at face value. The acoustic realization of the ironic tone of voice differs from language to language, whereas the ironic face manifests the speaker's negative stance and might thus have a universal basis. We conducted a study on 574 participants speaking 6 different languages (French, German, Dutch, English, Mandarin, and Italian-the control group) to verify whether they could recognize ironic remarks uttered in Italian in three different modalities: watching muted videos, listening to audio tracks, and when both cues were present.
View Article and Find Full Text PDFTrends Hear
October 2024
Computational Neuroscience of Speech and Hearing, Department of Computational Linguistics, University of Zurich, Zurich, Switzerland.
Comprehending speech in noise (SiN) poses a challenge for older hearing-impaired listeners, requiring auditory and working memory resources. Visual speech cues provide additional sensory information supporting speech understanding, while the extent of such visual benefit is characterized by large variability, which might be accounted for by individual differences in working memory capacity (WMC). In the current study, we investigated behavioral and neurofunctional (i.
View Article and Find Full Text PDFThe speech-driven facial animation technology is generally categorized into two main types: 3D and 2D talking face. Both of these have garnered considerable research attention in recent years. However, to our knowledge, the research into 3D talking face has not progressed as deeply as that of 2D talking face, particularly in terms of lip-sync and perceptual mouth movements.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!