NSP-SCD: A corpus construction protocol for child-directed print in understudied languages.

Behav Res Methods

NeuroSpin, CEA, Gif-sur-Yvette, France.

Published: April 2024

Child-directed print corpora enable systematic psycholinguistic investigations, but this research infrastructure is not available in many understudied languages. Moreover, researchers of understudied languages are dependent on manual tagging because precise automatized parsers are not yet available. One plausible way forward is to limit the intensive work to a small-sized corpus. However, with little systematic enquiry about approaches to corpus construction, it is unclear how robust a small corpus can be made. The current study examines the potential of a non-sequential sampling protocol for small corpus development (NSP-SCD) through a cross-corpora and within-corpus analysis. A corpus comprising 17,584 words was developed by applying the protocol to a larger corpus of 150,595 words from children's books for 3-to-10-year-olds. While the larger corpus will by definition have more instances of unique words and unique orthographic units, still, the selectively sampled small corpus approximated the larger corpus for lexical and orthographic diversity and was equivalent for orthographic representation and word length. Psycholinguistic complexity increased by book level and varied by parts of speech. Finally, in a robustness check of lexical diversity, the non-sequentially sampled small corpus was more efficient compared to a same-sized corpus constructed by simply using all sentences from a few books (402 books vs. seven books). If a small corpus must be used then non-sequential sampling from books stratified by book level makes the corpus statistics better approximate what is found in larger corpora. Overall, the protocol shows promise as a tool to advance the science of child language acquisition in understudied languages.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11133114PMC
http://dx.doi.org/10.3758/s13428-024-02339-xDOI Listing

Publication Analysis

Top Keywords

small corpus
20
understudied languages
16
corpus
13
larger corpus
12
corpus construction
8
child-directed print
8
non-sequential sampling
8
sampled small
8
book level
8
small
5

Similar Publications

Background: Hippocampal avoidance during prophylactic cranial irradiation (HA-PCI) is proposed to reduce neurocognitive decline, while preserving the benefits of PCI. We evaluated whether (HA-)PCI induces changes in white matter (WM) microstructure and whether sparing the hippocampus has an impact on preserving brain network topology. Additionally, we evaluated associations between topological metrics with hippocampal volume and neuropsychological outcomes.

View Article and Find Full Text PDF

Aquaculture source of atmospheric NO in China: Comparison of system types, management practices and measurement methods.

Environ Res

December 2024

School of Environmental Ecology and Biological Engineering, Institute of Changjiang Water Environment and Ecological Security, Key Laboratory for Green Chemical Process of Ministry of Education, Hubei Key Laboratory of Novel Reactor and Green Chemical Technology, Wuhan Institute of Technology, Wuhan, 430205, China. Electronic address:

Aquaculture systems contribute to atmospheric NO, but the magnitude of this NO source is largely uncertain. Here, we synthesized data from 139 aquaculture sites based on 59 peer-reviewed publications, and estimated that China's aquaculture systems emitted 9.68 Gg N yr (4.

View Article and Find Full Text PDF

Background: Reproductive efficiency is paramount in the dairy industry, where early pregnancy detection of dairy cows will allow to detect the non-pregnant animals early, thus enabling to re-synchronize them and getting them pregnant leading to decrease in calving interval, which, in turn, is critical for maximizing productivity and economic gain. The objective of this study was to evaluate the colour Doppler ultrasonography (CDUS) and peripheral blood leukocytes (PBLs)-based pregnancy-associated biomarker mRNAs expression for the earliest detection of pregnancy status in the dairy cows at post insemination. Intensively managed animals were ovulation synchronized and subjected to timed artificial insemination (TAI).

View Article and Find Full Text PDF

The demand for innovative synthetic polymers with improved properties is high, but their structural complexity and vast design space hinder rapid discovery. Machine learning-guided molecular design is a promising approach to accelerate polymer discovery. However, the scarcity of labeled polymer data and the complex hierarchical structure of synthetic polymers make generative design particularly challenging.

View Article and Find Full Text PDF

The study aimed to establish a long-term 3D cell culture model using luteinized follicular cells to investigate the functionality and life cycle of the CL in felids. A mixture of cell types from antral follicles was luteinized in vitro and cultured for up to 23 days. The method, initially applied to the domestic cat, was later extended to Persian and Clouded leopards.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!