Many acoustic features and machine learning models have been studied to build automatic detection systems to distinguish dysarthric speech from healthy speech. These systems can help to improve the reliability of diagnosis. However, speech recorded for diagnosis in real-life clinical conditions can differ from the training data of the detection system in terms of, for example, recording conditions, speaker identity, and language. These mismatches may lead to a reduction in detection performance in practical applications. In this study, we investigate the use of the wav2vec2 model as a feature extractor together with a support vector machine (SVM) classifier to build automatic detection systems for dysarthric speech. The performance of the wav2vec2 features is evaluated in two cross-database scenarios, language-dependent and language-independent, to study their generalizability to unseen speakers, recording conditions, and languages before and after fine-tuning the wav2vec2 model. The results revealed that the fine-tuned wav2vec2 features showed better generalization in both scenarios and gave an absolute accuracy improvement of 1.46%-8.65% compared to the non-fine-tuned wav2vec2 features.

Download full-text PDF

Source
http://dx.doi.org/10.1109/JBHI.2024.3392829DOI Listing

Publication Analysis

Top Keywords

wav2vec2 model
12
dysarthric speech
12
wav2vec2 features
12
fine-tuning wav2vec2
8
build automatic
8
automatic detection
8
detection systems
8
recording conditions
8
wav2vec2
6
detection
5

Similar Publications

Article Synopsis
  • The paper discusses using a deep learning model to objectively assess speech functions during awake craniotomies, aiming to improve surgical outcomes by minimizing reliance on clinician observations.
  • It involved analyzing 1883 audio clips from surgeries in Japan and France to train a Wav2Vec2-based model, which achieved an F1-score of 84.12% for Japanese data and 74.68% when tested across languages.
  • While the initial results are promising, further evaluation and integration of noise reduction techniques are necessary to enhance the model's performance and accuracy.
View Article and Find Full Text PDF
Article Synopsis
  • The study focuses on a new transformer-based architecture, TNet-Full, designed for classifying Mandarin tones using speech characteristics like fundamental frequency (F0) values and syllable/word boundaries.
  • Key components of TNet-Full include a contour encoder, rhythm encoder, and cross-attention mechanisms that enhance the interaction between tone contours and rhythmic information.
  • The model shows significant improvements in accuracy—24.4% for read speech and 6.3% for conversational speech—compared to a simpler baseline, indicating better tone recognition through stable temporal organization of syllables.
View Article and Find Full Text PDF

Alzheimer's Disease is a neurodegenerative disorder, and one of its common and prominent early symptoms is language impairment. Therefore, early diagnosis of Alzheimer's Disease through speech and text information is of significant importance. However, the multimodal data is often complex and inconsistent, which leads to inadequate feature extraction.

View Article and Find Full Text PDF

Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.

Clin Linguist Phon

August 2024

Department of Rehabilitation Medicine, Incheon St.Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.

Article Synopsis
  • The study develops an automatic speech recognition (ASR) model specifically to diagnose pronunciation problems in children with speech sound disorders (SSDs), aiming to replace manual transcription methods.
  • The researchers fine-tuned the wav2vec2.0 XLS-R model to better recognize the way children with SSDs pronounce words, achieving a Phoneme Error Rate (PER) of only 10%.
  • In comparison, a leading ASR model called Whisper struggled with this task, showing a much higher PER of about 50%, highlighting the need for more specialized ASR approaches in clinical settings.
View Article and Find Full Text PDF

Many acoustic features and machine learning models have been studied to build automatic detection systems to distinguish dysarthric speech from healthy speech. These systems can help to improve the reliability of diagnosis. However, speech recorded for diagnosis in real-life clinical conditions can differ from the training data of the detection system in terms of, for example, recording conditions, speaker identity, and language.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!