Publications by authors named "Mark Hasegawa-Johnson"

Systems inspired by progressive neural networks, transferring information from end-to-end articulatory feature detectors to similarly structured phone recognizers, are described. These networks, connecting the corresponding recurrent layers of pre-trained feature detector stacks and newly introduced phone recognizer stacks, were trained on data from four Asian languages, with experiments testing the system on those languages and four African languages. Later adjustments of these networks include the use of contrastive predictive coding layers at the inputs to those networks' recurrent portions.

View Article and Find Full Text PDF

Purpose: The Speech Accessibility Project (SAP) intends to facilitate research and development in automatic speech recognition (ASR) and other machine learning tasks for people with speech disabilities. The purpose of this article is to introduce this project as a resource for researchers, including baseline analysis of the first released data package.

Method: The project aims to facilitate ASR research by collecting, curating, and distributing transcribed U.

View Article and Find Full Text PDF

Across five studies, we present the preliminary technical validation of an infant-wearable platform, LittleBeats™, that integrates electrocardiogram (ECG), inertial measurement unit (IMU), and audio sensors. Each sensor modality is validated against data from gold-standard equipment using established algorithms and laboratory tasks. Interbeat interval (IBI) data obtained from the LittleBeats™ ECG sensor indicate acceptable mean absolute percent error rates for both adults (Study 1, = 16) and infants (Study 2, = 5) across low- and high-challenge sessions and expected patterns of change in respiratory sinus arrythmia (RSA).

View Article and Find Full Text PDF

Background: Wearable devices permit the continuous, unobtrusive collection of data from children in their natural environments and can transform our understanding of child development. Although the use of wearable devices has begun to emerge in research involving children, few studies have considered families' experiences and perspectives of participating in research of this kind.

Objective: Through a mixed methods approach, we assessed parents' and children's experiences of using a new wearable device in the home environment.

View Article and Find Full Text PDF

The methods of geometric morphometrics are commonly used to quantify morphology in a broad range of biological sciences. The application of these methods to large datasets is constrained by manual landmark placement limiting the number of landmarks and introducing observer bias. To move the field forward, we need to automate morphological phenotyping in ways that capture comprehensive representations of morphological variation with minimal observer bias.

View Article and Find Full Text PDF

Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory.

View Article and Find Full Text PDF

A language-independent automatic speech recognizer (ASR) is one that can be used for phonetic transcription in languages other than the languages in which it was trained. Language-independent ASR is difficult to train, because different languages implement phones differently: even when phonemes in two different languages are written using the same symbols in the international phonetic alphabet, they are differentiated by different distributions of language-dependent redundant articulatory features. This article demonstrates that the goal of language-independence may be approximated in different ways, depending on the size of the training set, the presence vs.

View Article and Find Full Text PDF

We design a framework for studying prelinguistic child voice from 3 to 24 months based on state-of-the-art algorithms in diarization. Our system consists of a time-invariant feature extractor, a context-dependent embedding generator, and a classifier. We study the effect of swapping out different components of the system, as well as changing loss function, to find the best performance.

View Article and Find Full Text PDF

In an effort to guide the development of a computer agent (CA)-based adviser system that presents patient-centered language to older adults (e.g., medication instructions in portal environments or smartphone apps), we evaluated 360 older and younger adults' responses to medication information delivered by a set of CAs.

View Article and Find Full Text PDF

Patient portals to Electronic Health Record (EHR) systems are underused by older adults because of limited system usability and usefulness, including difficulty understanding numeric information. We investigated whether enhanced context for portal messages about test results improved responses to these messages, comparing verbally, graphically, and video-enhanced formats. Older adults viewed scenarios with fictitious patient profiles and messages describing results for these patients from cholesterol or diabetes screening tests indicating lower, borderline, or higher risk levels.

View Article and Find Full Text PDF

Most mainstream automatic speech recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea that some frames are more important than others. Acoustic landmark theory exploits quantal nonlinearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be sufficient for speech perception.

View Article and Find Full Text PDF

The best actors, particularly classic Shakespearian actors, are experts at vocal expression. With prosodic inflection, change of voice quality, and non-textual utterances, they communicate emotion, emphasize ideas, create drama, and form a complementary language which works with the text to tell the story in the script. To begin to study selected elements of vocal expression in acted speech, corpora were curated from male actors' Hamlet and female actresses' Lady Macbeth soliloquy performances.

View Article and Find Full Text PDF

We describe a project intended to improve the use of Electronic Medical Record (EMR) patient portal information by older adults with diverse numeracy and literacy abilities, so that portals can better support patient-centered care. Patient portals are intended to bridge patients and providers by ensuring patients have continuous access to their health information and services. However, they are underutilized, especially by older adults with low health literacy, because they often function more as information repositories than as tools to engage patients.

View Article and Find Full Text PDF

Speech can be represented as a constellation of constricting vocal tract actions called gestures, whose temporal patterning with respect to one another is expressed in a gestural score. Current speech datasets do not come with gestural annotation and no formal gestural annotation procedure exists at present. This paper describes an iterative analysis-by-synthesis landmark-based time-warping architecture to perform gestural annotation of natural speech.

View Article and Find Full Text PDF

A multimodal approach combining acoustics, intelligibility ratings, articulography and surface electromyography was used to examine the characteristics of dysarthria due to cerebral palsy (CP). CV syllables were studied by obtaining the slope of F2 transition during the diphthong, tongue-jaw kinematics during the release of the onset consonant, and the related submental muscle activities and relating these measures to speech intelligibility. The results show that larger reductions of F2 slope are correlated with lower intelligibility in CP-related dysarthria.

View Article and Find Full Text PDF

Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker.

View Article and Find Full Text PDF

Background/aims: This study examined the spectral characteristics of American English vowels in dysarthria associated with cerebral palsy (CP), and investigated the relationship between a speaker's overall speech intelligibility and vowel contrast.

Methods: The data were collected from 12 American English native speakers (9 speakers with a diagnosis of CP and 3 controls). Primary measures were F(1) and F(2) frequencies of 3 corner vowels /i, a, u/ and 3 noncorner vowels /I, 3, */.

View Article and Find Full Text PDF

This paper analyses consonant articulation errors in dysarthric speech produced by seven American-English native speakers with cerebral palsy. Twenty-three consonant phonemes were transcribed with diacritics as necessary in order to represent non-phoneme misarticulations. Error frequencies were examined with respect to six variables: articulatory complexity, place of articulation, and manner of articulation of the target phoneme; and change in articulatory complexity, place, and manner resulting from the misarticulation.

View Article and Find Full Text PDF

In this paper, an acoustic model for the robustness analysis of optimal multipoint room equalization is proposed. The optimal multipoint equalization aims to have the optimal performance in a least-squares sense for all measured points. The model can be used for theoretical robustness estimation depending on the critical design parameters such as the number of measurement points, the distance between measurements, or the frequency before applying real equalization system.

View Article and Find Full Text PDF

Stuttering is a developmental speech disorder that occurs in 5% of children with spontaneous remission in approximately 70% of cases. Previous imaging studies in adults with persistent stuttering found left white matter deficiencies and reversed right-left asymmetries compared to fluent controls. We hypothesized that similar differences might be present indicating brain development differences in children at risk of stuttering.

View Article and Find Full Text PDF

Acoustic cues related to the voice source, including harmonic structure and spectral tilt, were examined for relevance to prosodic boundary detection. The measurements considered here comprise five categories: duration, pitch, harmonic structure, spectral tilt, and amplitude. Distributions of the measurements and statistical analysis show that the measurements may be used to differentiate between prosodic categories.

View Article and Find Full Text PDF

Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification.

View Article and Find Full Text PDF

This article evaluates intertalker variance of oral area, logarithm of the oral area, tongue height, and formant frequencies as a function of vowel category. The data consist of coronal magnetic resonance imaging (MRI) sequences and acoustic recordings of 5 talkers, each producing 11 different vowels. Tongue height (left, right, and midsagittal), palate height, and oral area were measured in 3 coronal sections anterior to the oropharyngeal bend and were subjected to multivariate analysis of variance, variance ratio analysis, and regression analysis.

View Article and Find Full Text PDF

Three-dimensional tongue shape during vowel production is analyzed using the three-mode PARAFAC (parallel factors) model. Three-dimensional MRI images of five speakers (9 vowels) are analyzed. Sixty-five virtual fleshpoints (13 segments along the rostral-caudal dimension and 5 segments along the right-left direction) are chosen based on the interpolated tongue shape images.

View Article and Find Full Text PDF