One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10417535 | PMC |
http://dx.doi.org/10.1109/slt54892.2023.10022498 | DOI Listing |
SLT Workshop Spok Lang Technol
January 2023
Department of Electrical Engineering, Columbia University, USA.
Br J Psychol
November 2020
Toulouse School of Economics, Université Toulouse 1 Capitole, France.
The sound of the voice has several acoustic features that influence the perception of how cooperative the speaker is. It remains unknown, however, whether these acoustic features are associated with actual cooperative behaviour. This issue is crucial to disentangle whether inferences of traits from voices are based on stereotypes, or facilitate the detection of cooperative partners.
View Article and Find Full Text PDFOne purpose of integrating voice interfaces into embedded vehicle systems is to reduce drivers' visual and manual distractions with 'infotainment' technologies. However, there is scant research on actual benefits in production vehicles or how different interface designs affect attentional demands. Driving performance, visual engagement, and indices of workload (heart rate, skin conductance, subjective ratings) were assessed in 80 drivers randomly assigned to drive a 2013 Chevrolet Equinox or Volvo XC60.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!