Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications.

Sensors (Basel)

Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea.

Published: October 2022

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user-system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9609693PMC
http://dx.doi.org/10.3390/s22207738DOI Listing

Publication Analysis

Top Keywords

speech recognition
20
speech
9
real environments
8
audio visual
8
recognition
6
noise-robust multimodal
4
multimodal audio-visual
4
audio-visual speech
4
recognition system
4
system speech-based
4

Similar Publications

Speech Technology for Automatic Recognition and Assessment of Dysarthric Speech: An Overview.

J Speech Lang Hear Res

January 2025

Centre for Language Studies, Radboud University, Nijmegen, the Netherlands.

Purpose: In this review article, we present an extensive overview of recent developments in the area of dysarthric speech research. One of the key objectives of speech technology research is to improve the quality of life of its users, as evidenced by the focus of current research trends on creating inclusive conversational interfaces that cater to pathological speech, out of which dysarthric speech is an important example. Applications of speech technology research for dysarthric speech demand a clear understanding of the acoustics of dysarthric speech as well as of speech technologies, including machine learning and deep neural networks for speech processing.

View Article and Find Full Text PDF

Automated analysis of spoken language differentiates multiple system atrophy from Parkinson's disease.

J Neurol

January 2025

Department of Circuit Theory, Faculty of Electrical Engineering, Czech Technical University in Prague, Technická 2, Praha 6, 16000, Prague, Czech Republic.

Background And Objectives: Patients with synucleinopathies such as multiple system atrophy (MSA) and Parkinson's disease (PD) frequently display speech and language abnormalities. We explore the diagnostic potential of automated linguistic analysis of natural spontaneous speech to differentiate MSA and PD.

Methods: Spontaneous speech of 39 participants with MSA compared to 39 drug-naive PD and 39 healthy controls matched for age and sex was transcribed and linguistically annotated using automatic speech recognition and natural language processing.

View Article and Find Full Text PDF

Background: The increasing bureaucratic burden in everyday clinical practice impairs doctor-patient communication (DPC). Effective use of digital technologies, such as automated semantic speech recognition (ASR) with automated extraction of diagnostically relevant information can provide a solution.

Objective: The aim was to determine the extent to which ASR in conjunction with semantic information extraction for automated documentation of the doctor-patient dialogue (ADAPI) can be integrated into everyday clinical practice using the IVI routine as an example and whether patient care can be improved through process optimization.

View Article and Find Full Text PDF

The classical view is that perceptual attunement to the native language, which emerges by 6-10 months, developmentally precedes phonological feature abstraction abilities. That assumption is challenged by findings from adults adopted into a new language environment at 3-5 months that imply they had already formed phonological feature abstractions about their birth language prior to 6 months. As phonological feature abstraction had not been directly tested in infants, we examined 4-6-month-olds' amodal abstraction of the labial versus coronal place of articulation distinction between consonants.

View Article and Find Full Text PDF

The application of artificial neural networks (ANNs) can be found in numerous fields, including image and speech recognition, natural language processing, and autonomous vehicles. As well, intrusion detection, the subject of this paper, relies heavily on it. Different intrusion detection models have been constructed using ANNs.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!