Background: Speech is a predominant mode of human communication. Speech digital recordings are inexpensive to record and contain rich health related information. Deep learning, a key method, excels in detecting intricate patterns, however, it requires substantial training data. Laboratories have invested significantly to gather extensive digital voice datasets for health insights. The challenge lies in securely sharing this data while protecting the speaker's privacy.
Method: We applied a Generative Adversarial Network (GAN) approach. GANs can generate a voice that closely resembles a real voice and is composed of four key components: (i) Generator, (ii) Formant Extractor, (iii) Speaker Embedding Extractor, and (iv) Discriminator. Model training involves leveraging adversarial loss, wherein the generator strives to produce a voice that convincingly mimics reality to deceive the discriminator. Simultaneously, the discriminator endeavors to discern whether the generated voice is authentic or synthetic. Eventually the generator produces a voice that resembles closely to the real voice.
Result: We performed zero-shot voice conversion using an emotion preserving GAN. The model preserves the fundamental frequency trajectories. This is one of the most important features in dementia classification.
Conclusion: By successfully altering voice prints while preserving sound quality, our initial findings suggest a way for sharing raw digital voice recordings. Future efforts will extend beyond preserving intonation patterns, focusing on preserving additional dementia related markers in the original signal.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1002/alz.091280 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!