Moving beyond word frequency based on tally counting: AI-generated familiarity estimates of words and phrases are an interesting additional index of language knowledge.

Behav Res Methods

ETSI de Telecomunicación, Universidad Politécnica de Madrid, Avenida Complutense, 30, 28040, Madrid, Spain.

Published: December 2024

This study investigates the potential of large language models (LLMs) to estimate the familiarity of words and multi-word expressions (MWEs). We validated LLM estimates for isolated words using existing human familiarity ratings and found strong correlations. LLM familiarity estimates performed even better in predicting lexical decision and naming performance in megastudies than the best available word frequency measures. We then applied LLM estimates to MWEs, also finding their effectiveness in measuring familiarity for these expressions. We have created a list of more than 400,000 English words and MWEs with LLM-generated familiarity estimates, which we hope will be a valuable resource for researchers. There is also a cleaned-up list of nearly 150,000 entries, excluding lesser-known stimuli, to streamline stimulus selection. Our findings highlight the advantages of LLM-based familiarity estimates, including their better performance than traditional word frequency measures (particularly for predicting word recognition accuracy), their ability to generalize to MWEs, availability for large lists of words, and ease of obtaining new estimates for all types of stimuli.

Download full-text PDF

Source
http://dx.doi.org/10.3758/s13428-024-02561-7DOI Listing

Publication Analysis

Top Keywords

familiarity estimates
16
word frequency
12
llm estimates
8
frequency measures
8
familiarity
7
estimates
7
moving word
4
frequency based
4
based tally
4
tally counting
4

Similar Publications

Moving beyond word frequency based on tally counting: AI-generated familiarity estimates of words and phrases are an interesting additional index of language knowledge.

Behav Res Methods

December 2024

ETSI de Telecomunicación, Universidad Politécnica de Madrid, Avenida Complutense, 30, 28040, Madrid, Spain.

This study investigates the potential of large language models (LLMs) to estimate the familiarity of words and multi-word expressions (MWEs). We validated LLM estimates for isolated words using existing human familiarity ratings and found strong correlations. LLM familiarity estimates performed even better in predicting lexical decision and naming performance in megastudies than the best available word frequency measures.

View Article and Find Full Text PDF

Despite being largely spoken and studied by language and cognitive scientists, Italian lacks large resources of language processing data. The Italian Crowdsourcing Project (ICP) is a dataset of word recognition times and accuracy including responses to 130,465 words, which makes it the largest dataset of its kind item-wise. The data were collected in an online word knowledge task in which over 156,000 native speakers of Italian took part.

View Article and Find Full Text PDF

Contrasting genetic burden for bipolar disorder: Early onset versus late onset in an older adult bipolar disorder sample.

Eur Neuropsychopharmacol

December 2024

Bipolar and Depressive Disorders Unit, Hospital Clinic de Barcelona, Barcelona, Spain; Fundació Clínic per la Recerca Biomèdica-Institut d'Investigacions Biomèdiques August Pi i Sunyer (FCRB-IDIBAPS), Barcelona, Spain; Centro de Investigación Biomédica en Red de Salud Mental (CIBERSAM), Instituto de Salud Carlos III, Madrid, Spain.

Older Adults with Bipolar Disorder (OABD) represent a heterogeneous group, including those with early and late onset of the disorder. Recent evidence shows both groups have distinct clinical, cognitive, and medical features, tied to different neurobiological profiles. This study explored the link between polygenic risk scores (PRS) for bipolar disorder (PRS-BD), schizophrenia (PRS-SCZ), and major depressive disorder (PRS-MDD) with age of onset in OABD.

View Article and Find Full Text PDF

IDyOMpy: A new Python-based model for statistical analysis of musical expectations.

J Neurosci Methods

December 2024

Laboratoire des Systèmes Perceptifs, Département d'Étude Cognitive, École Normale Supérieure, PSL, Paris, France; Institute for Systems Research, Electrical and Computer Engineering, University of Maryland, College Park, USA.

Background: IDyOM (Information Dynamics of Music) is the statistical model of music the most used in the community of neuroscience of music. It has been shown to allow for significant correlations with EEG (Marion, 2021), ECoG (Di Liberto, 2020) and fMRI (Cheung, 2019) recordings of human music listening. The language used for IDyOM -Lisp- is not very familiar to the neuroscience community and makes this model hard to use and more importantly to modify.

View Article and Find Full Text PDF

Multicenter Analysis of the Relationship Between Operative Team Familiarity and Safety and Efficiency Outcomes in Cardiac Surgery.

Circ Cardiovasc Qual Outcomes

December 2024

Surgical Sabermetrics Laboratory, Centre for Medical Informatics, Usher Institute, The University of Edinburgh, Scotland (S.Y.).

Background: Safety in cardiac surgical procedures is predicated on effective team dynamics. This study associated operative team familiarity (ie, the extent of clinical collaboration among surgical team members) with procedural efficiency and Society of Thoracic Surgeons (STS) adjudicated patient outcomes.

Methods: Institutional STS adult cardiac surgery registry and electronic health record data from 2014 to 2021 were evaluated across 3 quaternary hospitals.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!