Disentanglement of Prosody Representations via Diffusion Models and Scheduled Gradient Reversal.

Leyuan Qu Cornelius Weber Wei Wang Jia Jin Yingming Gao Taihao Li Stefan Wermter

IEEE Trans Neural Netw Learn Syst

Published: February 2025

Prosody plays a fundamental role in human speech and communication, facilitating intelligibility and conveying emotional and cognitive states. Extracting accurate prosodic information from speech is vital for building assistive technology, such as controllable speech synthesis, speaking style transfer, and speech emotion recognition (SER). However, it is challenging to disentangle speaker-independent prosody representations since prosodic attributes, such as intonation, excessively entangle with speaker-specific attributes, e.g., pitch. In this article, we propose a novel model, called Diffsody, to disentangle and refine prosody representations: 1) to disentangle prosody representations, we leverage the expressive generative ability of a diffusion model by conditioning it on quantified semantic information and pretrained speaker embeddings. Additionally, a prosody encoder automatically learns prosody representations used for spectrogram reconstruction in an unsupervised fashion; and 2) to refine and learn speaker-invariant prosody representations, a scheduled gradient reversal layer (sGRL) is proposed and integrated into the prosody encoder of Diffsody. We extensively evaluate Diffsody through qualitative and quantitative means. t-SNE visualization and speaker verification experiments demonstrate the efficacy of the sGRL method in preventing speaker-specific information leakage. Experimental results on speaker-independent SER and automatic depression detection (ADD) tasks demonstrate that Diffsody can efficiently factorize speaker-independent prosody representations, resulting in a significant boost in SER and ADD. In addition, Diffsody synergistically integrates with the semantic representation model WavLM, which leads to a discernibly elevated performance, outperforming contemporary methods in both SER and ADD tasks. Furthermore, the Diffsody model exhibits promising potential for various practical applications, such as voice or style conversion. Some audio samples can be found on our https://leyuanqu.github.io/Diffsody/demo website.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TNNLS.2025.3534822	DOI Listing

Publication Analysis

Top Keywords

prosody representations

prosody

scheduled gradient

gradient reversal

speaker-independent prosody

prosody encoder

add tasks

ser add

representations

diffsody

Similar Publications

Cortical processing of discrete prosodic patterns in continuous speech.

Nat Commun

March 2025

Center for Neuroscience, University of Pittsburgh, Pittsburgh, PA, USA.

G Nike Gnanateja Kyle Rupp Fernando Llanos Jasmine Hect James S German

Prosody has a vital function in speech, structuring a speaker's intended message for the listener. The superior temporal gyrus (STG) is considered a critical hub for prosody, but the role of earlier auditory regions like Heschl's gyrus (HG), associated with pitch processing, remains unclear. Using intracerebral recordings in humans and non-human primate models, we investigated prosody processing in narrative speech, focusing on pitch accents-abstract phonological units that signal word prominence and communicative intent.

View Article and Find Full Text PDF

Similar Publications

Disentanglement of Prosody Representations via Diffusion Models and Scheduled Gradient Reversal.

IEEE Trans Neural Netw Learn Syst

February 2025

Leyuan Qu Cornelius Weber Wei Wang Jia Jin Yingming Gao

View Article and Find Full Text PDF

Similar Publications

Examining the Neural Markers of Speech Rhythm in Silent Reading Using Mass Univariate Statistics of EEG Single Trials.

Brain Sci

November 2024

Interdisciplinary Ph.D. Program in Literacy Studies, Middle Tennessee State University, Murfreesboro, TN 37132, USA.

Stephanie J Powell Srishti Nayak Cyrille L Magne

Article Synopsis

The Implicit Prosody Hypothesis (IPH) suggests that people create internal vocal patterns while reading silently, similar to those used in spoken language.
The study used EEG to analyze brain responses as participants read sequences of words with different stress patterns, revealing that unexpected stress in words triggered stronger brain reactions.
Results indicated that various brain wave activities correlate with rhythmic expectations in language, supporting the idea that the same neural networks are involved in processing both spoken and silently read language.

View Article and Find Full Text PDF

Similar Publications

Perception of voice cues and speech-in-speech by children with prelingual single-sided deafness and a cochlear implant.

Hear Res

December 2024

Dept. of Otorhinolaryngology/Head and Neck Surgery, University Medical Center Groningen, University of Groningen, The Netherlands; Research School of Behavioral and Cognitive Neuroscience, Graduate School of Medical Sciences, University of Groningen, The Netherlands; W.J. Kolff Institute for Biomedical Engineering and Materials Science, Graduate School of Medical Sciences, University of Groningen, The Netherlands. Electronic address:

Tine Arras Laura Rachman Astrid van Wieringen Deniz Başkent

Voice cues, such as fundamental frequency (F0) and vocal tract length (VTL), help listeners identify the speaker's gender, perceive the linguistic and emotional prosody, and segregate competing talkers. Postlingually implanted adult cochlear implant (CI) users seem to have difficulty in perceiving and making use of voice cues, especially of VTL. Early implanted child CI users, in contrast, perceive and make use of both voice cues better than CI adults, and in patterns similar to their peers with normal hearing (NH).

View Article and Find Full Text PDF

Similar Publications

Functional alterations of lateral temporal cortex for processing voice prosody in adults with autism spectrum disorder.

Cereb Cortex

September 2024

Medical Institute of Developmental Disabilities Research, Showa University, 6-11-11 Kita-Karasuyama, Setagaya-ku, Tokyo 157-8577, Japan.

Ryu-Ichiro Hashimoto Rieko Okada Ryuta Aoki Motoaki Nakamura Haruhisa Ohta

The human auditory system includes discrete cortical patches and selective regions for processing voice information, including emotional prosody. Although behavioral evidence indicates individuals with autism spectrum disorder (ASD) have difficulties in recognizing emotional prosody, it remains understudied whether and how localized voice patches (VPs) and other voice-sensitive regions are functionally altered in processing prosody. This fMRI study investigated neural responses to prosodic voices in 25 adult males with ASD and 33 controls using voices of anger, sadness, and happiness with varying degrees of emotion.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!