Objectives: Accurate segmentation of the vocal tract from MRI data is essential for various voice, speech, and singing applications. Manual segmentation is time-intensive and susceptible to errors. This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.
Study Design: This study employed a comparative design, evaluating four deep learning architectures for vocal tract segmentation using an open-source dataset of 3D-MRI scans of French speakers.
Methods: Fifty-three vocal tract volumes from 10 French speakers were manually annotated by an expert vocologist, assisted by two graduate students in voice science. These included 21 unique French phonemes and three unique voiceless tasks. Four state-of-the-art deep learning segmentation algorithms were evaluated: 2D slice-by-slice U-Net, 3D U-Net, 3D U-Net with transfer learning (pre-trained on lung CT), and 3D transformer U-Net (3D U-NetR). The STAPLE algorithm, which combines segmentations from multiple annotators to generate a probabilistic estimate of the true segmentation, was used to create reference segmentations for evaluation. Model performance was assessed using the Dice coefficient, Hausdorff distance, and structural similarity index measure.
Results: The 3D U-Net and 3D U-Net with transfer learning models achieved the highest Dice coefficients (0.896±0.05 and 0.896±0.04, respectively). The 3D U-Net with transfer learning performed comparably to the 3D U-Net while using less than half the training data. It, along with the 2D slice-by-slice U-Net models, demonstrated lower variability in HD distance compared to the 3D U-Net and 3D U-NetR models. All models exhibited challenges in segmenting certain sounds, particularly /kõn/. Qualitative assessment by a voice expert revealed anatomically correct segmentations in the oropharyngeal and laryngopharyngeal spaces for all models, except the 2D slice-by-slice U-NET, and frequent errors with all models near bony regions (eg, teeth).
Conclusions: This study emphasizes the effectiveness of 3D convolutional networks, especially with transfer learning, for automatic vocal tract segmentation from 3D MRI. Future research should focus on improving the segmentation of challenging vocal tract configurations and refining boundary delineations.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.jvoice.2025.02.026 | DOI Listing |
Indian J Otolaryngol Head Neck Surg
February 2025
Department of ENT, Mahatma Gandhi Medical College and Research Institute, Sri Balaji Vidyapeeth University, Pillaiyarkuppam, Pondicherry, 607402 India.
Laryngopharyngeal reflux disease (LPRD) is characterized by the backflow of gastric contents into the laryngopharynx, distinct from gastroesophageal reflux disease (GERD). Prevalence among otolaryngology patients ranges from 4 to 30% and being the major cause for hoarseness of voice. Common symptoms include hoarseness, chronic coughing, globus sensation, throat clearing and endoscopic evaluation reveals signs like posterior commissure hypertrophy and vocal fold edema.
View Article and Find Full Text PDFJ Voice
March 2025
Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa; Department of Radiology, University of Iowa, Iowa City, Iowa. Electronic address:
Objectives: Accurate segmentation of the vocal tract from MRI data is essential for various voice, speech, and singing applications. Manual segmentation is time-intensive and susceptible to errors. This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.
View Article and Find Full Text PDFAnnu Int Conf IEEE Eng Med Biol Soc
July 2024
Speech impairment resulting from laryngectomy causes severe physiological and psychological distress to laryngectomee. In clinical practice, the upper vocal tract articulatory organs function normally in most laryngectomee. The potential to reconstruct speech by leveraging articulatory information is of significant importance, offering a meaningful contribution to the effective rehabilitation of speech in these patients.
View Article and Find Full Text PDFAnnu Int Conf IEEE Eng Med Biol Soc
July 2024
This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs from the audio and video modalities.
View Article and Find Full Text PDFNat Hum Behav
March 2025
Department of Neurological Surgery, University of California, San Francisco, CA, USA.
Voluntary, flexible stopping of speech output is an essential aspect of speech motor control, especially during natural conversations. The cognitive and neural mechanisms of speech inhibition are not well understood. Here we have recorded direct high-density cortical activity while participants engaged in continuous speech production and were visually cued to stop speaking.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!