Speech emotion recognition (SER) is an important application in Affective Computing and Artificial Intelligence. Recently, there has been a significant interest in Deep Neural Networks using speech spectrograms. As the two-dimensional representation of the spectrogram includes more speech characteristics, research interest in convolution neural networks (CNNs) or advanced image recognition models is leveraged to learn deep patterns in a spectrogram to effectively perform SER. Accordingly, in this study, we propose a novel SER model based on the learning of the utterance-level spectrogram. First, we use the Spatial Pyramid Pooling (SPP) strategy to remove the size constraint associated with the CNN-based image recognition task. Then, the SPP layer is deployed to extract both the global-level prominent feature vector and multi-local-level feature vector, followed by an attention model to weigh the feature vectors. Finally, we apply the ArcFace layer, typically used for face recognition, to the SER task, thereby obtaining improved SER performance. Our model achieved an unweighted accuracy of 67.9% on IEMOCAP and 77.6% on EMODB datasets.

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41598-025-92640-2DOI Listing

Publication Analysis

Top Keywords

speech emotion
8
emotion recognition
8
recognition ser
8
neural networks
8
image recognition
8
feature vector
8
recognition
5
ser
5
multi-dilated convolution
4
convolution network
4

Similar Publications

Background: Global concern exists for workplace violence against healthcare workers (HCWs), especially in low and middle-income nations. This violence includes physical, verbal, or sexual abuse and has a significant impact despite initiatives like Occupational Safety and Health Administration (OSHA) guidelines. We conducted a study in Kenya to address this issue.

View Article and Find Full Text PDF

Speech emotion recognition (SER) is an important application in Affective Computing and Artificial Intelligence. Recently, there has been a significant interest in Deep Neural Networks using speech spectrograms. As the two-dimensional representation of the spectrogram includes more speech characteristics, research interest in convolution neural networks (CNNs) or advanced image recognition models is leveraged to learn deep patterns in a spectrogram to effectively perform SER.

View Article and Find Full Text PDF

Music training is widely claimed to enhance nonmusical abilities, yet causal evidence remains inconclusive. Moreover, research tends to focus on cognitive over socioemotional outcomes. In two studies, we investigated whether music training improves emotion recognition in voices and faces among school-aged children.

View Article and Find Full Text PDF

In online teaching environments, the lack of direct emotional interaction between teachers and students poses challenges for teachers to consciously and effectively manage their emotional expressions. The design and implementation of an early warning system for teaching provide a novel approach to intelligent evaluation and improvement of online education. This study focuses on segmenting different emotional segments and recognizing emotions in instructional videos.

View Article and Find Full Text PDF

Background: An Elimination Diet (ED) or Healthy Diet (HD) may be effective in reducing symptoms of Attention-Deficit/Hyperactivity Disorder (ADHD), but long-term maintenance effects and feasibility have never been examined.

Methods: One-year prospective follow-up of a sample of 165 children (5-12 years) with ADHD randomized (unblinded; 1:1) to 5 weeks treatment with either ED ( = 84) or HD ( = 81) and a non-randomized comparator arm including 58 children being treated with Care as Usual (CAU). Dietary participants were allowed to add or switch to CAU treatment after 5 weeks.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!