While machine coding of data has dramatically advanced in recent years, the literature raises significant concerns about validation of LLM classification showing, for example, that reliability varies greatly by prompt and temperature tuning, across subject areas and tasks-especially in "zero-shot" applications. This paper contributes to the discussion of validation in several different ways. To test the relative performance of supervised and semi-supervised algorithms when coding political data, we compare three models' performances to each other over multiple iterations for each model and to trained expert coding of data. We also examine changes in performance resulting from prompt engineering and pre-processing of source data. To ameliorate concerns regarding LLM's pre-training on test data, we assess performance by updating an existing dataset beyond what is publicly available. Overall, we find that only GPT-4 approaches trained expert coders when coding contexts familiar to human coders and codes more consistently across contexts. We conclude by discussing some benefits and drawbacks of machine coding moving forward.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11102067PMC
http://dx.doi.org/10.1093/pnasnexus/pgae165DOI Listing

Publication Analysis

Top Keywords

machine coding
8
coding data
8
trained expert
8
coding
7
data
6
coding machines
4
machines machine-assisted
4
machine-assisted coding
4
coding rare
4
rare event
4

Similar Publications

Background: Despite significant advancements in the development of blood biomarkers for AD, challenges persist due to the complex interplay of genetic and environmental risk factors in AD pathogenesis. Epigenetic processes, including non-coding RNAs and especially microRNAs (miRs), have emerged as important players in the molecular mechanisms underlying neurodegenerative diseases. MiRs have the ability to fine-tune gene expression and proteostasis, and microRNAome profiling in liquid biopsies is gaining increasing interest since changes in miR levels can indicate the presence of multiple pathologies.

View Article and Find Full Text PDF

Basic Science and Pathogenesis.

Alzheimers Dement

December 2024

Chambers-Grundy Center for Transformative Neuroscience, Department of Brain Health, School of Integrated Health Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA.

Background: Although high-throughput DNA/RNA sequencing technologies have generated massive genetic and genomic data in human disease, translation of these findings into new patient treatment has not materialized by lack of effective approaches, such as Artificial Intelligence (AL) and Machine Learning (ML) tools.

Method: To address this problem, we have used AI/ML approaches, Mendelian randomization (MR), and large patient's genetic and functional genomic data to evaluate druggable targets using Alzheimer's disease (AD) as a prototypical example. We utilized the genomic instruments from 9 expression quantitative trait loci (eQTL) and 3 protein quantitative trait loci (pQTL) datasets across five human brain regions from three biobanks.

View Article and Find Full Text PDF

Community-acquired pneumonia (CAP) is associated with high mortality rates and often results in prolonged hospital stays. The potential of machine learning to enhance prediction accuracy in this context is significant, yet clinicians often lack the programming skills required for effective data mining. This study aimed to assess the effectiveness of a low-code approach for assisting clinicians with data mining for mortality and length of stay (LOS) prediction in patients with CAP.

View Article and Find Full Text PDF

Time Course of Orientation Ensemble Representation in the Human Brain.

J Neurosci

January 2025

School of Psychological and Cognitive Sciences and Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing 100871, People's Republic of China

Natural scenes are filled with groups of similar items. Humans employ ensemble coding to extract the summary statistical information of the environment, thereby enhancing the efficiency of information processing, something particularly useful when observing natural scenes. However, the neural mechanisms underlying the representation of ensemble information in the brain remain elusive.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!