Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human-computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora-LRW and AUTSL-and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9967234 | PMC |
http://dx.doi.org/10.3390/s23042284 | DOI Listing |
Sci Rep
January 2025
University Institute of Computing, Chandigarh University, Punjab, India.
Automatic Sign Language Recognition Systems (ASLR) offers smooth communication between hearing-impaired and normal-hearing individuals, enhancing educational opportunities for impaired. However, it struggles with "curse of dimensionality" due to excessive features resulting in prolonged training time and exhaustive computational demand. This paper proposes technique that integrates machine learning and swarm intelligence to effectively address this issue.
View Article and Find Full Text PDFSci Data
January 2025
School of Informatics, The University of Edinburgh, Edinburgh, EH8 9AB, United Kingdom.
Myoelectric control has emerged as a promising approach for a wide range of applications, including controlling limb prosthetics, teleoperating robots and enabling immersive interactions in the Metaverse. However, the accuracy and robustness of myoelectric control systems are often affected by various factors, including muscle fatigue, perspiration, drifts in electrode positions and changes in arm position. The latter has received less attention despite its significant impact on signal quality and decoding accuracy.
View Article and Find Full Text PDFData Brief
February 2025
ADA University, Baku, Azerbaijan.
Advancements in sign language processing technology hinge on the availability of extensive, reliable datasets, comprehensive instructions, and adherence to ethical guidelines. To facilitate progress in gesture recognition and translation systems and to support the Azerbaijani sign language community we present the Azerbaijani Sign Language Dataset (AzSLD). This comprehensive dataset was collected from a diverse group of sign language users, encompassing a range of linguistic parameters.
View Article and Find Full Text PDFCogn Neurodyn
December 2025
Shanghai University, Shanghai, China.
Neurodynamic observations indicate that the cerebral cortex evolved by self-organizing into functional networks, These networks, or distributed clusters of regions, display various degrees of attention maps based on input. Traditionally, the study of network self-organization relies predominantly on static data, overlooking temporal information in dynamic neuromorphic data. This paper proposes Temporal Self-Organizing (TSO) method for neuromorphic data processing using a spiking neural network.
View Article and Find Full Text PDFInt J Comput Assist Radiol Surg
January 2025
Advanced Medical Devices Laboratory, Kyushu University, Nishi-ku, Fukuoka, 819-0382, Japan.
Purpose: This paper presents a deep learning approach to recognize and predict surgical activity in robot-assisted minimally invasive surgery (RAMIS). Our primary objective is to deploy the developed model for implementing a real-time surgical risk monitoring system within the realm of RAMIS.
Methods: We propose a modified Transformer model with the architecture comprising no positional encoding, 5 fully connected layers, 1 encoder, and 3 decoders.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!