The word embedding association test (WEAT) is an important method for measuring linguistic biases against social groups such as ethnic minorities in large text corpora. It does so by comparing the semantic relatedness of words prototypical of the groups (e.g., names unique to those groups) and attribute words (e.g., 'pleasant' and 'unpleasant' words). We show that anti-Black WEAT estimates from geo-tagged social media data at the level of metropolitan statistical areas strongly correlate with several measures of racial animus-even when controlling for sociodemographic covariates. However, we also show that every one of these correlations is explained by a third variable: the frequency of Black names in the underlying corpora relative to White names. This occurs because word embeddings tend to group positive (negative) words and frequent (rare) words together in the estimated semantic space. As the frequency of Black names on social media is strongly correlated with Black Americans' prevalence in the population, this results in spuriously high anti-Black WEAT estimates wherever few Black Americans live. This suggests that research using the WEAT to measure bias should consider term frequency, and also demonstrates the potential consequences of using black-box models like word embeddings to study human cognition and behavior.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10147343PMC
http://dx.doi.org/10.1609/icwsm.v16i1.19399DOI Listing

Publication Analysis

Top Keywords

word embeddings
12
anti-black weat
8
weat estimates
8
social media
8
frequency black
8
black names
8
negative associations
4
word
4
associations word
4
embeddings predict
4

Similar Publications

The productive use of morphological information is considered one of the possible ways in which speakers of a language understand and learn unknown words. In the present study we investigate if, and how, also adult L2 learners exploit morphological information to process unknown words by analyzing the impact of language proficiency in the processing of novel derivations. Italian L2 learners, divided into three proficiency groups, participated in a lexical decision where pseudo-words could embed existing stems (e.

View Article and Find Full Text PDF

Objective: To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier.

Materials And Methods: We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data.

View Article and Find Full Text PDF

Protein succinylation, a post-translational modification wherein a succinyl group (-CO-CH₂-CH₂-CO-) attaches to lysine residues, plays a critical regulatory role in cellular processes. Dysregulated succinylation has been implicated in the onset and progression of various diseases, including liver, cardiac, pulmonary, and neurological disorders. However, identifying succinylation sites through experimental methods is often labor-intensive, costly, and technically challenging.

View Article and Find Full Text PDF

Android malware detection remains a critical issue for mobile security. Cybercriminals target Android since it is the most popular smartphone operating system (OS). Malware detection, analysis, and classification have become diverse research areas.

View Article and Find Full Text PDF

Continuous theta-burst stimulation demonstrates language-network-specific causal effects on syntactic processing.

Neuroimage

January 2025

Max Planck Partner Group, School of International Chinese Language Education, Beijing Normal University, Beijing, China; Department of Neuropsychology, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany. Electronic address:

Hierarchical syntactic structure processing is proposed to be at the core of the human language faculty. Syntactic processing is supported by the left fronto-temporal language network, including a core area in the inferior frontal gyrus as well as its interaction with the posterior temporal lobe (i.e.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!