AI Article Synopsis

  • Audio-visual speech recognition (AVSR) improves speech recognition accuracy by incorporating visual information, especially in noisy environments, while hand gestures enhance human-computer interactions.
  • The study introduces two deep neural network models for AVSR and gesture recognition, focusing on advanced fine-tuning strategies and three methods for merging audio and visual data.
  • Testing on the LRW and AUTSL datasets yielded outstanding results, achieving 98.76% accuracy for AVSR and 98.56% for gesture recognition, showcasing the effectiveness of the proposed methods for mobile device applications.

Article Abstract

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human-computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora-LRW and AUTSL-and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9967234PMC
http://dx.doi.org/10.3390/s23042284DOI Listing

Publication Analysis

Top Keywords

gesture recognition
28
audio-visual speech
24
speech recognition
16
sensors mobile
12
mobile devices
12
recognition
11
speech gesture
8
main novelty
8
recognition lies
8
dataset equal
8

Similar Publications

Automatic Sign Language Recognition Systems (ASLR) offers smooth communication between hearing-impaired and normal-hearing individuals, enhancing educational opportunities for impaired. However, it struggles with "curse of dimensionality" due to excessive features resulting in prolonged training time and exhaustive computational demand. This paper proposes technique that integrates machine learning and swarm intelligence to effectively address this issue.

View Article and Find Full Text PDF

Myoelectric control has emerged as a promising approach for a wide range of applications, including controlling limb prosthetics, teleoperating robots and enabling immersive interactions in the Metaverse. However, the accuracy and robustness of myoelectric control systems are often affected by various factors, including muscle fatigue, perspiration, drifts in electrode positions and changes in arm position. The latter has received less attention despite its significant impact on signal quality and decoding accuracy.

View Article and Find Full Text PDF

Advancements in sign language processing technology hinge on the availability of extensive, reliable datasets, comprehensive instructions, and adherence to ethical guidelines. To facilitate progress in gesture recognition and translation systems and to support the Azerbaijani sign language community we present the Azerbaijani Sign Language Dataset (AzSLD). This comprehensive dataset was collected from a diverse group of sign language users, encompassing a range of linguistic parameters.

View Article and Find Full Text PDF

Neurodynamic observations indicate that the cerebral cortex evolved by self-organizing into functional networks, These networks, or distributed clusters of regions, display various degrees of attention maps based on input. Traditionally, the study of network self-organization relies predominantly on static data, overlooking temporal information in dynamic neuromorphic data. This paper proposes Temporal Self-Organizing (TSO) method for neuromorphic data processing using a spiking neural network.

View Article and Find Full Text PDF

A real-time approach for surgical activity recognition and prediction based on transformer models in robot-assisted surgery.

Int J Comput Assist Radiol Surg

January 2025

Advanced Medical Devices Laboratory, Kyushu University, Nishi-ku, Fukuoka, 819-0382, Japan.

Purpose: This paper presents a deep learning approach to recognize and predict surgical activity in robot-assisted minimally invasive surgery (RAMIS). Our primary objective is to deploy the developed model for implementing a real-time surgical risk monitoring system within the realm of RAMIS.

Methods: We propose a modified Transformer model with the architecture comprising no positional encoding, 5 fully connected layers, 1 encoder, and 3 decoders.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!