Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks.

Subin Erattakulangara Karthika Kelat Katie Burnham Rachel Balbi Sarah E Gerard David Meyer Sajan Goud Lingala

J Voice

Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa; Department of Radiology, University of Iowa, Iowa City, Iowa. Electronic address:

Published: March 2025

Objectives: Accurate segmentation of the vocal tract from MRI data is essential for various voice, speech, and singing applications. Manual segmentation is time-intensive and susceptible to errors. This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.

Study Design: This study employed a comparative design, evaluating four deep learning architectures for vocal tract segmentation using an open-source dataset of 3D-MRI scans of French speakers.

Methods: Fifty-three vocal tract volumes from 10 French speakers were manually annotated by an expert vocologist, assisted by two graduate students in voice science. These included 21 unique French phonemes and three unique voiceless tasks. Four state-of-the-art deep learning segmentation algorithms were evaluated: 2D slice-by-slice U-Net, 3D U-Net, 3D U-Net with transfer learning (pre-trained on lung CT), and 3D transformer U-Net (3D U-NetR). The STAPLE algorithm, which combines segmentations from multiple annotators to generate a probabilistic estimate of the true segmentation, was used to create reference segmentations for evaluation. Model performance was assessed using the Dice coefficient, Hausdorff distance, and structural similarity index measure.

Results: The 3D U-Net and 3D U-Net with transfer learning models achieved the highest Dice coefficients (0.896±0.05 and 0.896±0.04, respectively). The 3D U-Net with transfer learning performed comparably to the 3D U-Net while using less than half the training data. It, along with the 2D slice-by-slice U-Net models, demonstrated lower variability in HD distance compared to the 3D U-Net and 3D U-NetR models. All models exhibited challenges in segmenting certain sounds, particularly /kõn/. Qualitative assessment by a voice expert revealed anatomically correct segmentations in the oropharyngeal and laryngopharyngeal spaces for all models, except the 2D slice-by-slice U-NET, and frequent errors with all models near bony regions (eg, teeth).

Conclusions: This study emphasizes the effectiveness of 3D convolutional networks, especially with transfer learning, for automatic vocal tract segmentation from 3D MRI. Future research should focus on improving the segmentation of challenging vocal tract configurations and refining boundary delineations.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.jvoice.2025.02.026	DOI Listing

Publication Analysis

Top Keywords

vocal tract

deep learning

transfer learning

tract segmentation

slice-by-slice u-net

u-net u-net

u-net transfer

u-net

segmentation

manually annotated

Similar Publications

Acid Assault: Unmasking the Toll of Laryngopharyngeal Reflux Disease on Vocal Health - A Literature Review.

Indian J Otolaryngol Head Neck Surg

February 2025

Department of ENT, Mahatma Gandhi Medical College and Research Institute, Sri Balaji Vidyapeeth University, Pillaiyarkuppam, Pondicherry, 607402 India.

Kirubhagaran Ravichandran Karthikeyan Padmanabhan

Laryngopharyngeal reflux disease (LPRD) is characterized by the backflow of gastric contents into the laryngopharynx, distinct from gastroesophageal reflux disease (GERD). Prevalence among otolaryngology patients ranges from 4 to 30% and being the major cause for hoarseness of voice. Common symptoms include hoarseness, chronic coughing, globus sensation, throat clearing and endoscopic evaluation reveals signs like posterior commissure hypertrophy and vocal fold edema.

View Article and Find Full Text PDF

Similar Publications

Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks.

J Voice

March 2025

Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa; Department of Radiology, University of Iowa, Iowa City, Iowa. Electronic address:

Subin Erattakulangara Karthika Kelat Katie Burnham Rachel Balbi Sarah E Gerard

View Article and Find Full Text PDF

Similar Publications

Mandarin Speech Reconstruction from Tongue Motion Ultrasound Images based on Generative Adversarial Networks.

Annu Int Conf IEEE Eng Med Biol Soc

July 2024

Fengji Li Fei Shen Ding Ma Shaochuan Zhang Jie Zhou

Speech impairment resulting from laryngectomy causes severe physiological and psychological distress to laryngectomee. In clinical practice, the upper vocal tract articulatory organs function normally in most laryngectomee. The potential to reconstruct speech by leveraging articulatory information is of significant importance, offering a meaningful contribution to the effective rehabilitation of speech in these patients.

View Article and Find Full Text PDF

Similar Publications

A multi-modal approach for identifying schizophrenia using cross-modal attention.

Annu Int Conf IEEE Eng Med Biol Soc

July 2024

Gowtham Premananth Yashish M Siriwarden Philip Resnik Carol Espy-Wilson

This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs from the audio and video modalities.

View Article and Find Full Text PDF

Similar Publications

Inhibitory control of speech production in the human premotor frontal cortex.

Nat Hum Behav

March 2025

Department of Neurological Surgery, University of California, San Francisco, CA, USA.

Lingyun Zhao Alexander B Silva G Lynn Kurteff Edward F Chang

Voluntary, flexible stopping of speech output is an essential aspect of speech motor control, especially during natural conversations. The cognitive and neural mechanisms of speech inhibition are not well understood. Here we have recorded direct high-density cortical activity while participants engaged in continuous speech production and were visually cued to stop speaking.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!