Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition.

Sensors (Basel)

Institute of Communication Acoustics, Ruhr University Bochum, 44801 Bochum, Germany.

Published: July 2022

Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture-the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9370936PMC
http://dx.doi.org/10.3390/s22155501DOI Listing

Publication Analysis

Top Keywords

audio-visual speech
8
speech recognition
8
improve performance
8
large-vocabulary datasets
8
model proposed
8
proposed dfn
8
dynamic stream-weighting
8
achieves relative
8
reliability-based large-vocabulary
4
large-vocabulary audio-visual
4

Similar Publications

A comprehensive analysis of everyday sound perception can be achieved using electroencephalography (EEG) with the concurrent acquisition of information about the environment. While extensive research has been dedicated to speech perception, the complexities of auditory perception within everyday environments, specifically the types of information and the key features to extract, remain less explored. Our study aims to systematically investigate the relevance of different feature categories: discrete sound-identity markers, general cognitive state information, and acoustic representations, including discrete sound onset, the envelope, and mel-spectrogram.

View Article and Find Full Text PDF
Article Synopsis
  • The adoption of digital synchronous video communication for telecare and teletherapy has surged recently, fueled by COVID-19 and a broader trend toward digital healthcare in the past two decades.
  • A study involving 20 qualitative interviews with health professionals and patients from Germany, Austria, and Switzerland identified six main categories and 20 sub-categories that can influence the effectiveness of telesettings, highlighting the importance of motivation and digital skills.
  • The findings suggest a need for structured guidelines and training to support telesetting, emphasizing the adaptation of methodologies to incorporate audio-visual technology effectively.
View Article and Find Full Text PDF

Cross-Linguistic Recognition of Irony Through Visual and Acoustic Cues.

J Psycholinguist Res

November 2024

Department of Psychology, University of Milan-Bicocca, Piazza Dell'Ateneo Nuovo, 1, 20126, Milan, Italy.

To avoid misunderstandings, ironic speakers may accompany their ironic remarks with a particular intonation and specific facial expressions that signal that the message should not be taken at face value. The acoustic realization of the ironic tone of voice differs from language to language, whereas the ironic face manifests the speaker's negative stance and might thus have a universal basis. We conducted a study on 574 participants speaking 6 different languages (French, German, Dutch, English, Mandarin, and Italian-the control group) to verify whether they could recognize ironic remarks uttered in Italian in three different modalities: watching muted videos, listening to audio tracks, and when both cues were present.

View Article and Find Full Text PDF

Comprehending speech in noise (SiN) poses a challenge for older hearing-impaired listeners, requiring auditory and working memory resources. Visual speech cues provide additional sensory information supporting speech understanding, while the extent of such visual benefit is characterized by large variability, which might be accounted for by individual differences in working memory capacity (WMC). In the current study, we investigated behavioral and neurofunctional (i.

View Article and Find Full Text PDF

The speech-driven facial animation technology is generally categorized into two main types: 3D and 2D talking face. Both of these have garnered considerable research attention in recent years. However, to our knowledge, the research into 3D talking face has not progressed as deeply as that of 2D talking face, particularly in terms of lip-sync and perceptual mouth movements.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!