Prosody plays a fundamental role in human speech and communication, facilitating intelligibility and conveying emotional and cognitive states. Extracting accurate prosodic information from speech is vital for building assistive technology, such as controllable speech synthesis, speaking style transfer, and speech emotion recognition (SER). However, it is challenging to disentangle speaker-independent prosody representations since prosodic attributes, such as intonation, excessively entangle with speaker-specific attributes, e.g., pitch. In this article, we propose a novel model, called Diffsody, to disentangle and refine prosody representations: 1) to disentangle prosody representations, we leverage the expressive generative ability of a diffusion model by conditioning it on quantified semantic information and pretrained speaker embeddings. Additionally, a prosody encoder automatically learns prosody representations used for spectrogram reconstruction in an unsupervised fashion; and 2) to refine and learn speaker-invariant prosody representations, a scheduled gradient reversal layer (sGRL) is proposed and integrated into the prosody encoder of Diffsody. We extensively evaluate Diffsody through qualitative and quantitative means. t-SNE visualization and speaker verification experiments demonstrate the efficacy of the sGRL method in preventing speaker-specific information leakage. Experimental results on speaker-independent SER and automatic depression detection (ADD) tasks demonstrate that Diffsody can efficiently factorize speaker-independent prosody representations, resulting in a significant boost in SER and ADD. In addition, Diffsody synergistically integrates with the semantic representation model WavLM, which leads to a discernibly elevated performance, outperforming contemporary methods in both SER and ADD tasks. Furthermore, the Diffsody model exhibits promising potential for various practical applications, such as voice or style conversion. Some audio samples can be found on our https://leyuanqu.github.io/Diffsody/demo website.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2025.3534822 | DOI Listing |
Nat Commun
March 2025
Center for Neuroscience, University of Pittsburgh, Pittsburgh, PA, USA.
Prosody has a vital function in speech, structuring a speaker's intended message for the listener. The superior temporal gyrus (STG) is considered a critical hub for prosody, but the role of earlier auditory regions like Heschl's gyrus (HG), associated with pitch processing, remains unclear. Using intracerebral recordings in humans and non-human primate models, we investigated prosody processing in narrative speech, focusing on pitch accents-abstract phonological units that signal word prominence and communicative intent.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
February 2025
Prosody plays a fundamental role in human speech and communication, facilitating intelligibility and conveying emotional and cognitive states. Extracting accurate prosodic information from speech is vital for building assistive technology, such as controllable speech synthesis, speaking style transfer, and speech emotion recognition (SER). However, it is challenging to disentangle speaker-independent prosody representations since prosodic attributes, such as intonation, excessively entangle with speaker-specific attributes, e.
View Article and Find Full Text PDFBrain Sci
November 2024
Interdisciplinary Ph.D. Program in Literacy Studies, Middle Tennessee State University, Murfreesboro, TN 37132, USA.
Hear Res
December 2024
Dept. of Otorhinolaryngology/Head and Neck Surgery, University Medical Center Groningen, University of Groningen, The Netherlands; Research School of Behavioral and Cognitive Neuroscience, Graduate School of Medical Sciences, University of Groningen, The Netherlands; W.J. Kolff Institute for Biomedical Engineering and Materials Science, Graduate School of Medical Sciences, University of Groningen, The Netherlands. Electronic address:
Voice cues, such as fundamental frequency (F0) and vocal tract length (VTL), help listeners identify the speaker's gender, perceive the linguistic and emotional prosody, and segregate competing talkers. Postlingually implanted adult cochlear implant (CI) users seem to have difficulty in perceiving and making use of voice cues, especially of VTL. Early implanted child CI users, in contrast, perceive and make use of both voice cues better than CI adults, and in patterns similar to their peers with normal hearing (NH).
View Article and Find Full Text PDFCereb Cortex
September 2024
Medical Institute of Developmental Disabilities Research, Showa University, 6-11-11 Kita-Karasuyama, Setagaya-ku, Tokyo 157-8577, Japan.
The human auditory system includes discrete cortical patches and selective regions for processing voice information, including emotional prosody. Although behavioral evidence indicates individuals with autism spectrum disorder (ASD) have difficulties in recognizing emotional prosody, it remains understudied whether and how localized voice patches (VPs) and other voice-sensitive regions are functionally altered in processing prosody. This fMRI study investigated neural responses to prosodic voices in 25 adult males with ASD and 33 controls using voices of anger, sadness, and happiness with varying degrees of emotion.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!