Peptide-HLA (pHLA) binding prediction is essential in screening peptide candidates for personalized peptide vaccines. Machine learning (ML) pHLA binding prediction tools are trained on vast amounts of data and are effective in screening peptide candidates. Most ML models report the ability to generalize to HLA alleles unseen during training ("pan-allele" models). However, the use of datasets with imbalanced allele content raises concerns about biased model performance. First, we examine the data bias of two ML-based pan-allele pHLA binding predictors. We find that the pHLA datasets overrepresent alleles from geographic populations of high-income countries. Second, we show that the identified data bias is perpetuated within ML models, leading to algorithmic bias and subpar performance for alleles expressed in low-income geographic populations. We draw attention to the potential therapeutic consequences of this bias, and we challenge the use of the term "pan-allele" to describe models trained with currently available public datasets.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10770483PMC
http://dx.doi.org/10.1016/j.isci.2023.108613DOI Listing

Publication Analysis

Top Keywords

phla binding
12
binding predictors
8
binding prediction
8
screening peptide
8
peptide candidates
8
data bias
8
geographic populations
8
hlaequity examining
4
examining biases
4
biases pan-allele
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!