AI Article Synopsis

  • The study aimed to validate a predictive model using Short Tandem Repeat (STR) profiles to determine ancestry affiliation among distinct population groups (African Americans, Asians, Caucasians, and Hispanic Americans) through machine learning.
  • A total of 360,000 genetic profiles were created, with the XGBoost model proving to be the most accurate, achieving a 94.24% accuracy overall and even higher rates when differentiating specific subgroups.
  • Results indicated that a larger training set size positively impacted accuracy, peaking at 94% with 90,000 profiles per category, while using a reduced number of markers still yielded relatively high accuracy, suggesting STR-based models could enhance forensic investigations involving ancestry.

Article Abstract

The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.fsigen.2024.103183DOI Listing

Publication Analysis

Top Keywords

machine learning
12
accuracy rating
12
str profiles
8
african americans
8
americans asians
8
asians caucasians
8
90000 profiles
8
model accuracy
8
profiles category
8
profiles
6

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!