DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer.

Junxiao Yu Zhengyuan Xu Xu He Jian Wang Bin Liu Rui Feng Songsheng Zhu Wei Wang Jianqing Li

Entropy (Basel)

Jiangsu Province Engineering Research Center of Smart Wearable and Rehabilitation Devices, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China.

Published: December 2022

- Text-to-speech (TTS) systems, like Tacotron2, struggle with long sentences due to traditional attention methods, often producing errors such as run-on sentences and unnatural-sounding speech.
- The proposed model introduces a deep-inherited attention (DIA) mechanism and a local-sensitive factor (LSF) to improve alignment, increase emotional expression, and enhance context connections in the generated speech.
- Results show the new DIA-TTS model outperforms previous methods, achieving a high mean opinion score (MOS) of 4.48 for naturalness, and demonstrates significant improvements in phrase breaks and attention accuracy through various studies.

Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token-frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857677	PMC
http://dx.doi.org/10.3390/e25010041	DOI Listing

Publication Analysis

Top Keywords

dia mechanism

multi-rnn layers

phrase breaks

dia

dia-tts deep-inherited

deep-inherited attention-based

attention-based text-to-speech

text-to-speech synthesizer

synthesizer text-to-speech

text-to-speech tts

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered

A PHP Error was encountered