A corpus of Chinese word segmentation agreement.

Yiu-Kei Tsang Ming Yan Jinger Pan Megan Yin Kan Chan

Behav Res Methods

Department of Education Studies, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong.

Published: December 2024

The absence of explicit word boundaries is a distinctive characteristic of Chinese script, setting it apart from most alphabetic scripts, leading to word boundary disagreement among readers. Previous studies have examined how this feature may influence reading performance. However, further investigations are required to generate more ecologically valid and generalizable findings. In order to advance our understanding of the impact of word boundaries in Chinese reading, we introduce the Chinese Word Segmentation Agreement (CWSA) corpus. This corpus consists of 500 sentences, comprising 9813 character tokens and 1590 character types, and provides data on word segmentation agreement at each character position. The data revealed a high level of overall segmentation agreement (92%). However, participants disagreed on the position of word boundaries in 8.96% of the cases. Moreover, about 85% of the sentences contained at least one ambiguous word boundary. The character strings with high levels of disagreement were tentatively classified into three categories, namely the morphosyntactic type (e.g., "-"), modifier-head type (e.g., "-"), and others (e.g., "-"). Finally, the agreement scores also significantly influenced reading behaviors, as evidenced by analyses with published eye movement data. Specifically, a high level of disagreement was associated with longer single fixation durations. We discuss the implications of these results and highlight how the CWSA corpus can facilitate future research on word segmentation in Chinese reading.

Download full-text PDF	Source
http://dx.doi.org/10.3758/s13428-024-02528-8	DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11682008	PMC

Publication Analysis

Top Keywords

word segmentation

segmentation agreement

word boundaries

word

chinese word

word boundary

chinese reading

cwsa corpus

high level

type "-"

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered