Prosody plays a fundamental role in human speech and communication, facilitating intelligibility and conveying emotional and cognitive states. Extracting accurate prosodic information from speech is vital for building assistive technology, such as controllable speech synthesis, speaking style transfer, and speech emotion recognition (SER). However, it is challenging to disentangle speaker-independent prosody representations since prosodic attributes, such as intonation, excessively entangle with speaker-specific attributes, e.g., pitch. In this article, we propose a novel model, called Diffsody, to disentangle and refine prosody representations: 1) to disentangle prosody representations, we leverage the expressive generative ability of a diffusion model by conditioning it on quantified semantic information and pretrained speaker embeddings. Additionally, a prosody encoder automatically learns prosody representations used for spectrogram reconstruction in an unsupervised fashion; and 2) to refine and learn speaker-invariant prosody representations, a scheduled gradient reversal layer (sGRL) is proposed and integrated into the prosody encoder of Diffsody. We extensively evaluate Diffsody through qualitative and quantitative means. t-SNE visualization and speaker verification experiments demonstrate the efficacy of the sGRL method in preventing speaker-specific information leakage. Experimental results on speaker-independent SER and automatic depression detection (ADD) tasks demonstrate that Diffsody can efficiently factorize speaker-independent prosody representations, resulting in a significant boost in SER and ADD. In addition, Diffsody synergistically integrates with the semantic representation model WavLM, which leads to a discernibly elevated performance, outperforming contemporary methods in both SER and ADD tasks. Furthermore, the Diffsody model exhibits promising potential for various practical applications, such as voice or style conversion. Some audio samples can be found on our https://leyuanqu.github.io/Diffsody/demo website.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2025.3534822DOI Listing

Publication Analysis

Top Keywords

prosody representations
28
prosody
9
scheduled gradient
8
gradient reversal
8
speaker-independent prosody
8
prosody encoder
8
add tasks
8
ser add
8
representations
7
diffsody
6

Similar Publications

Prosody has a vital function in speech, structuring a speaker's intended message for the listener. The superior temporal gyrus (STG) is considered a critical hub for prosody, but the role of earlier auditory regions like Heschl's gyrus (HG), associated with pitch processing, remains unclear. Using intracerebral recordings in humans and non-human primate models, we investigated prosody processing in narrative speech, focusing on pitch accents-abstract phonological units that signal word prominence and communicative intent.

View Article and Find Full Text PDF

Prosody plays a fundamental role in human speech and communication, facilitating intelligibility and conveying emotional and cognitive states. Extracting accurate prosodic information from speech is vital for building assistive technology, such as controllable speech synthesis, speaking style transfer, and speech emotion recognition (SER). However, it is challenging to disentangle speaker-independent prosody representations since prosodic attributes, such as intonation, excessively entangle with speaker-specific attributes, e.

View Article and Find Full Text PDF
Article Synopsis
  • The Implicit Prosody Hypothesis (IPH) suggests that people create internal vocal patterns while reading silently, similar to those used in spoken language.
  • The study used EEG to analyze brain responses as participants read sequences of words with different stress patterns, revealing that unexpected stress in words triggered stronger brain reactions.
  • Results indicated that various brain wave activities correlate with rhythmic expectations in language, supporting the idea that the same neural networks are involved in processing both spoken and silently read language.
View Article and Find Full Text PDF

Perception of voice cues and speech-in-speech by children with prelingual single-sided deafness and a cochlear implant.

Hear Res

December 2024

Dept. of Otorhinolaryngology/Head and Neck Surgery, University Medical Center Groningen, University of Groningen, The Netherlands; Research School of Behavioral and Cognitive Neuroscience, Graduate School of Medical Sciences, University of Groningen, The Netherlands; W.J. Kolff Institute for Biomedical Engineering and Materials Science, Graduate School of Medical Sciences, University of Groningen, The Netherlands. Electronic address:

Voice cues, such as fundamental frequency (F0) and vocal tract length (VTL), help listeners identify the speaker's gender, perceive the linguistic and emotional prosody, and segregate competing talkers. Postlingually implanted adult cochlear implant (CI) users seem to have difficulty in perceiving and making use of voice cues, especially of VTL. Early implanted child CI users, in contrast, perceive and make use of both voice cues better than CI adults, and in patterns similar to their peers with normal hearing (NH).

View Article and Find Full Text PDF

Functional alterations of lateral temporal cortex for processing voice prosody in adults with autism spectrum disorder.

Cereb Cortex

September 2024

Medical Institute of Developmental Disabilities Research, Showa University, 6-11-11 Kita-Karasuyama, Setagaya-ku, Tokyo 157-8577, Japan.

The human auditory system includes discrete cortical patches and selective regions for processing voice information, including emotional prosody. Although behavioral evidence indicates individuals with autism spectrum disorder (ASD) have difficulties in recognizing emotional prosody, it remains understudied whether and how localized voice patches (VPs) and other voice-sensitive regions are functionally altered in processing prosody. This fMRI study investigated neural responses to prosodic voices in 25 adult males with ASD and 33 controls using voices of anger, sadness, and happiness with varying degrees of emotion.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!