STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.

SLT Workshop Spok Lang Technol

Department of Electrical Engineering, Columbia University, USA.

Published: January 2023

AI Article Synopsis

  • One-shot voice conversion allows the transformation of speech from one speaker to another using just a brief audio sample of the target speaker, but accurately separating the speaker's identity from the content is challenging.
  • The proposed method employs transfer learning from style-based text-to-speech models and incorporates cycle consistent and adversarial training techniques to improve one-shot voice conversion outcomes.
  • Evaluations reveal that this new approach achieves greater naturalness and similarity in voice conversion compared to existing state-of-the-art models.

Article Abstract

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10417535PMC
http://dx.doi.org/10.1109/slt54892.2023.10022498DOI Listing

Publication Analysis

Top Keywords

one-shot voice
12
voice conversion
12
tts models
12
knowledge transfer
8
style-based tts
8
target speaker
8
disentangled speech
8
speech representation
8
speech
5
styletts-vc one-shot
4

Similar Publications

Article Synopsis
  • One-shot voice conversion allows the transformation of speech from one speaker to another using just a brief audio sample of the target speaker, but accurately separating the speaker's identity from the content is challenging.
  • The proposed method employs transfer learning from style-based text-to-speech models and incorporates cycle consistent and adversarial training techniques to improve one-shot voice conversion outcomes.
  • Evaluations reveal that this new approach achieves greater naturalness and similarity in voice conversion compared to existing state-of-the-art models.
View Article and Find Full Text PDF

The sound of the voice has several acoustic features that influence the perception of how cooperative the speaker is. It remains unknown, however, whether these acoustic features are associated with actual cooperative behaviour. This issue is crucial to disentangle whether inferences of traits from voices are based on stereotypes, or facilitate the detection of cooperative partners.

View Article and Find Full Text PDF

One purpose of integrating voice interfaces into embedded vehicle systems is to reduce drivers' visual and manual distractions with 'infotainment' technologies. However, there is scant research on actual benefits in production vehicles or how different interface designs affect attentional demands. Driving performance, visual engagement, and indices of workload (heart rate, skin conductance, subjective ratings) were assessed in 80 drivers randomly assigned to drive a 2013 Chevrolet Equinox or Volvo XC60.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!