Optimized sample selection for cost-efficient long-read population sequencing.

T Rhyker Ranallo-Benavidez Zachary Lemmon Sebastian Soyk Sergey Aganezov William J Salerno Rajiv C McCoy Zachary B Lippman Michael C Schatz Fritz J Sedlazeck

Genome Res

Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA.

Published: May 2021

A new approach in population genetics involves genotyping large cohorts with low-resolution techniques and then resequencing selected individuals using more comprehensive long-read sequencing to capture genetic diversity.
SVCollector is a tool that identifies an optimal subset of individuals for resequencing by analyzing population-level genetic data, focusing on inclusion across various subpopulations to ensure a representative sample.
By applying a combination of fast algorithms, SVCollector has been shown to produce more balanced selections of individuals from diverse backgrounds compared to traditional naive methods, resulting in a better representation of genetic variants and diversity.

An increasingly important scenario in population genetics is when a large cohort has been genotyped using a low-resolution approach (e.g., microarrays, exome capture, short-read WGS), from which a few individuals are resequenced using a more comprehensive approach, especially long-read sequencing. The subset of individuals selected should ensure that the captured genetic diversity is fully representative and includes variants across all subpopulations. For example, human variation has historically focused on individuals with European ancestry, but this represents a small fraction of the overall diversity. Addressing this, SVCollector identifies the optimal subset of individuals for resequencing by analyzing population-level VCF files from low-resolution genotyping studies. It then computes a ranked list of samples that maximizes the total number of variants present within a subset of a given size. To solve this optimization problem, SVCollector implements a fast, greedy heuristic and an exact algorithm using integer linear programming. We apply SVCollector on simulated data, 2504 human genomes from the 1000 Genomes Project, and 3024 genomes from the 3000 Rice Genomes Project and show the rankings it computes are more representative than alternative naive strategies. When selecting an optimal subset of 100 samples in these cohorts, SVCollector identifies individuals from every subpopulation, whereas naive methods yield an unbalanced selection. Finally, we show the number of variants present in cohorts selected using this approach follows a power-law distribution that is naturally related to the population genetic concept of the allele frequency spectrum, allowing us to estimate the diversity present with increasing numbers of samples.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8092009	PMC
http://dx.doi.org/10.1101/gr.264879.120	DOI Listing

Publication Analysis

Top Keywords

subset individuals

svcollector identifies

optimal subset

number variants

genomes project

individuals

optimized sample

sample selection

selection cost-efficient

cost-efficient long-read

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!