Objectives: Accurate segmentation of the vocal tract from MRI data is essential for various voice, speech, and singing applications. Manual segmentation is time-intensive and susceptible to errors. This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.

Study Design: This study employed a comparative design, evaluating four deep learning architectures for vocal tract segmentation using an open-source dataset of 3D-MRI scans of French speakers.

Methods: Fifty-three vocal tract volumes from 10 French speakers were manually annotated by an expert vocologist, assisted by two graduate students in voice science. These included 21 unique French phonemes and three unique voiceless tasks. Four state-of-the-art deep learning segmentation algorithms were evaluated: 2D slice-by-slice U-Net, 3D U-Net, 3D U-Net with transfer learning (pre-trained on lung CT), and 3D transformer U-Net (3D U-NetR). The STAPLE algorithm, which combines segmentations from multiple annotators to generate a probabilistic estimate of the true segmentation, was used to create reference segmentations for evaluation. Model performance was assessed using the Dice coefficient, Hausdorff distance, and structural similarity index measure.

Results: The 3D U-Net and 3D U-Net with transfer learning models achieved the highest Dice coefficients (0.896±0.05 and 0.896±0.04, respectively). The 3D U-Net with transfer learning performed comparably to the 3D U-Net while using less than half the training data. It, along with the 2D slice-by-slice U-Net models, demonstrated lower variability in HD distance compared to the 3D U-Net and 3D U-NetR models. All models exhibited challenges in segmenting certain sounds, particularly /kõn/. Qualitative assessment by a voice expert revealed anatomically correct segmentations in the oropharyngeal and laryngopharyngeal spaces for all models, except the 2D slice-by-slice U-NET, and frequent errors with all models near bony regions (eg, teeth).

Conclusions: This study emphasizes the effectiveness of 3D convolutional networks, especially with transfer learning, for automatic vocal tract segmentation from 3D MRI. Future research should focus on improving the segmentation of challenging vocal tract configurations and refining boundary delineations.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jvoice.2025.02.026DOI Listing

Publication Analysis

Top Keywords

vocal tract
28
deep learning
16
transfer learning
16
tract segmentation
12
slice-by-slice u-net
12
u-net u-net
12
u-net transfer
12
u-net
11
segmentation
9
manually annotated
8

Similar Publications

Acid Assault: Unmasking the Toll of Laryngopharyngeal Reflux Disease on Vocal Health - A Literature Review.

Indian J Otolaryngol Head Neck Surg

February 2025

Department of ENT, Mahatma Gandhi Medical College and Research Institute, Sri Balaji Vidyapeeth University, Pillaiyarkuppam, Pondicherry, 607402 India.

Laryngopharyngeal reflux disease (LPRD) is characterized by the backflow of gastric contents into the laryngopharynx, distinct from gastroesophageal reflux disease (GERD). Prevalence among otolaryngology patients ranges from 4 to 30% and being the major cause for hoarseness of voice. Common symptoms include hoarseness, chronic coughing, globus sensation, throat clearing and endoscopic evaluation reveals signs like posterior commissure hypertrophy and vocal fold edema.

View Article and Find Full Text PDF

Objectives: Accurate segmentation of the vocal tract from MRI data is essential for various voice, speech, and singing applications. Manual segmentation is time-intensive and susceptible to errors. This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.

View Article and Find Full Text PDF

Speech impairment resulting from laryngectomy causes severe physiological and psychological distress to laryngectomee. In clinical practice, the upper vocal tract articulatory organs function normally in most laryngectomee. The potential to reconstruct speech by leveraging articulatory information is of significant importance, offering a meaningful contribution to the effective rehabilitation of speech in these patients.

View Article and Find Full Text PDF

This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs from the audio and video modalities.

View Article and Find Full Text PDF

Voluntary, flexible stopping of speech output is an essential aspect of speech motor control, especially during natural conversations. The cognitive and neural mechanisms of speech inhibition are not well understood. Here we have recorded direct high-density cortical activity while participants engaged in continuous speech production and were visually cued to stop speaking.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!