Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices.

Dmitry Ryumin Denis Ivanko Elena Ryumina

Sensors (Basel)

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia.

Published: February 2023

Audio-visual speech recognition (AVSR) improves speech recognition accuracy by incorporating visual information, especially in noisy environments, while hand gestures enhance human-computer interactions.
The study introduces two deep neural network models for AVSR and gesture recognition, focusing on advanced fine-tuning strategies and three methods for merging audio and visual data.
Testing on the LRW and AUTSL datasets yielded outstanding results, achieving 98.76% accuracy for AVSR and 98.56% for gesture recognition, showcasing the effectiveness of the proposed methods for mobile device applications.

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human-computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora-LRW and AUTSL-and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9967234	PMC
http://dx.doi.org/10.3390/s23042284	DOI Listing

Publication Analysis

Top Keywords

gesture recognition

audio-visual speech

speech recognition

sensors mobile

mobile devices

recognition

speech gesture

main novelty

recognition lies

dataset equal

Similar Publications

Improved feature reduction framework for sign language recognition using autoencoders and adaptive Grey Wolf Optimization.

Sci Rep

January 2025

University Institute of Computing, Chandigarh University, Punjab, India.

Rajeev Goel Sandhya Bansal Kavita Gupta

Automatic Sign Language Recognition Systems (ASLR) offers smooth communication between hearing-impaired and normal-hearing individuals, enhancing educational opportunities for impaired. However, it struggles with "curse of dimensionality" due to excessive features resulting in prolonged training time and exhaustive computational demand. This paper proposes technique that integrates machine learning and swarm intelligence to effectively address this issue.

View Article and Find Full Text PDF

Similar Publications

EMG Dataset for Gesture Recognition with Arm Translation.

Sci Data

January 2025

School of Informatics, The University of Edinburgh, Edinburgh, EH8 9AB, United Kingdom.

Iris Kyranou Katarzyna Szymaniak Kianoush Nazarpour

Myoelectric control has emerged as a promising approach for a wide range of applications, including controlling limb prosthetics, teleoperating robots and enabling immersive interactions in the Metaverse. However, the accuracy and robustness of myoelectric control systems are often affected by various factors, including muscle fatigue, perspiration, drifts in electrode positions and changes in arm position. The latter has received less attention despite its significant impact on signal quality and decoding accuracy.

View Article and Find Full Text PDF

Similar Publications

AzSLD: Azerbaijani sign language dataset for fingerspelling, word, and sentence translation with baseline software.

Data Brief

February 2025

ADA University, Baku, Azerbaijan.

Nigar Alishzade Jamaladdin Hasanov

Advancements in sign language processing technology hinge on the availability of extensive, reliable datasets, comprehensive instructions, and adherence to ethical guidelines. To facilitate progress in gesture recognition and translation systems and to support the Azerbaijani sign language community we present the Azerbaijani Sign Language Dataset (AzSLD). This comprehensive dataset was collected from a diverse group of sign language users, encompassing a range of linguistic parameters.

View Article and Find Full Text PDF

Similar Publications

Sg-snn: a self-organizing spiking neural network based on temporal information.

Cogn Neurodyn

December 2025

Shanghai University, Shanghai, China.

Shouwei Gao Ruixin Zhu Yu Qin Wenyu Tang Hao Zhou

Neurodynamic observations indicate that the cerebral cortex evolved by self-organizing into functional networks, These networks, or distributed clusters of regions, display various degrees of attention maps based on input. Traditionally, the study of network self-organization relies predominantly on static data, overlooking temporal information in dynamic neuromorphic data. This paper proposes Temporal Self-Organizing (TSO) method for neuromorphic data processing using a spiking neural network.

View Article and Find Full Text PDF

Similar Publications

A real-time approach for surgical activity recognition and prediction based on transformer models in robot-assisted surgery.

Int J Comput Assist Radiol Surg

January 2025

Advanced Medical Devices Laboratory, Kyushu University, Nishi-ku, Fukuoka, 819-0382, Japan.

Ketai Chen D S V Bandara Jumpei Arata

Purpose: This paper presents a deep learning approach to recognize and predict surgical activity in robot-assisted minimally invasive surgery (RAMIS). Our primary objective is to deploy the developed model for implementing a real-time surgical risk monitoring system within the realm of RAMIS.

Methods: We propose a modified Transformer model with the architecture comprising no positional encoding, 5 fully connected layers, 1 encoder, and 3 decoders.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!