Recent studies have shown that synthesized versions of American English vowels are less accurately identified when the natural time-varying spectral changes are eliminated by holding the formant frequencies constant over the duration of the vowel. A limitation of these experiments has been that vowels produced by formant synthesis are generally less accurately identified than the natural vowels after which they are modeled. To overcome this limitation, a high-quality speech analysis-synthesis system (STRAIGHT) was used to synthesize versions of 12 American English vowels spoken by adults and children. Vowels synthesized with STRAIGHT were identified as accurately as the natural versions, in contrast with previous results from our laboratory showing identification rates 9%-12% lower for the same vowels synthesized using the cascade formant model. Consistent with earlier studies, identification accuracy was not reduced when the fundamental frequency was held constant across the vowel. However, elimination of time-varying changes in the spectral envelope using STRAIGHT led to a greater reduction in accuracy (23%) than was previously found with cascade formant synthesis (11%). A statistical pattern recognition model, applied to acoustic measurements of the natural and synthesized vowels, predicted both the higher identification accuracy for vowels synthesized using STRAIGHT compared to formant synthesis, and the greater effects of holding the formant frequencies constant over time with STRAIGHT synthesis. Taken together, the experiment and modeling results suggest that formant estimation errors and incorrect rendering of spectral and temporal cues by cascade formant synthesis contribute to lower identification accuracy and underestimation of the role of time-varying spectral change in vowels.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1121/1.1852549 | DOI Listing |
This study sets out to investigate the potential effect of males' testosterone level on speech production and speech perception. Regarding speech production, we investigate intra- and inter-individual variation in mean fundamental frequency (f) and formant frequencies and highlight the potential interacting effect of another hormone, i.e.
View Article and Find Full Text PDFLogoped Phoniatr Vocol
April 2024
Speech and Voice Research Laboratory, Faculty of Social Sciences, Tampere University, Tampere, Finland.
Background: To the best of our knowledge, studies on the relationship between spectral energy distribution and the degree of perceived voices are still sparse. Through an auditory-perceptual test we aimed to explore the spectral features that may relate with the auditory-perception of voices.
Methods: Ten judges who were blind to the test's tasks and stimuli rated the amount of twang perceived on seventy-six audio samples.
Front Artif Intell
February 2024
Department of Industrial Engineering, Seoul National University of Science and Technology, Seoul, Republic of Korea.
In this study, we propose a simple yet effective method for incorporating the source speaker's characteristics in the target speaker's speech. This allows our model to generate the speech of the target speaker with the style of the source speaker. To achieve this, we focus on the attention model within the speech synthesis model, which learns various speaker features such as spectrogram, pitch, intensity, formant, pulse, and voice breaks.
View Article and Find Full Text PDFPrimates
May 2024
Primate Behavioral Ecology Lab, Instituto de Neuro-etología, Universidad Veracruzana, Av. Dr. Luis Castelazo Ayala S/N, CP 91190, Xalapa, México.
Formant frequency spacing of long-distance vocalizations is allometrically related to body size and could represent an honest signal of fighting potential. There is, however, only limited evidence that primates use formant spacing to assess the competitive potential of rivals during interactions with extragroup males, a risky context. We hypothesized that if formant spacing of long-distance calls is inversely related to the fighting potential of male mantled howler monkeys (Alouatta palliata), then males should: (1) be more likely and (2) faster to display vocal responses to calling rivals; (3) be more likely and (4) faster to approach calling rivals; and have higher fecal (5) glucocorticoid and (6) testosterone metabolite concentrations in response to rivals calling at intermediate and high formant spacing than to those with low formant spacing.
View Article and Find Full Text PDFJ Acoust Soc Am
February 2024
Yale Child Study Center, School of Medicine, Yale University, New Haven, Connecticut 06511, USA.
The reassigned spectrogram (RS) has emerged as the most accurate way to infer vocal tract resonances from the acoustic signal [Shadle, Nam, and Whalen (2016). "Comparing measurement errors for formants in synthetic and natural vowels," J. Acoust.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!