Integrated approach to generate artificial samples with low tumor fraction for somatic variant calling benchmarking.

Aldo Sergi Luca Beltrame Sergio Marchini Marco Masseroli

BMC Bioinformatics

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133, Milan, Italy.

Published: May 2024

Background: High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling.

Results: Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants.

Conclusions: Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11077792	PMC
http://dx.doi.org/10.1186/s12859-024-05793-8	DOI Listing

Publication Analysis

Top Keywords

variant calling

artificial datasets

tumor dna

low-fraction variant

low tumor

tumor fraction

calling benchmarking

minimal tumor

generation artificial

low-fraction variants

Similar Publications

Several variants on chromosome 10 are associated with coarse hair diameter in Dazu black goats (Capra hircus).

Anim Genet

February 2025

College of Animal Science and Technology, Southwest University, Chongqing, China.

Jipan Zhang Jiabei Fang Siyuan Zhang Jiele Xu Yongju Zhao

Goats typically have double coats, with the outermost coarse hairs providing protection against mechanical and radiation damage. While much attention has been paid to cashmere due to its status as a high-end textile material, there is limited information available on coarse hair. This study aimed to identify genomic variants, such as single nucleotide polymorphisms (SNPs) and insertion/deletions (indels), associated with coarse hair diameter using a genome-wide association study (GWAS).

View Article and Find Full Text PDF

Similar Publications

SV-JIM, detailed pairwise structural variant calling using long-reads and genome assemblies.

Methods

January 2025

Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada. Electronic address:

Clarence Todd Lingling Jin Ian McQuillan

This paper proposes a detailed process for SV calling that permits a data-driven assessment of multiple SV callers that uses both genome assemblies and long-reads. The process is implemented as a software pipeline named Structural Variant - Jaccard Index Measure, or SVJIM, using the Snakemake [20] workflow management system. Like most state-of-the-art SV callers, SV-JIM detects the presence of variations between pairs of genomes, but it streamlines the numerous SV calling stages into a single process for user convenience and evaluates the multiple SV sets produced using the Jaccard index measure to identify those with the highest consistency among the included SV callers.

View Article and Find Full Text PDF

Similar Publications

A next-generation sequencing-based universal target panel and algorithm for one-stop detection of copy number alterations and single-nucleotide variations in the HBB gene cluster for rapid diagnosis of β-thalassemia.

Mol Biol Rep

January 2025

Department of Zoology, The University of Burdwan, Bardhaman, West Bengal, 713104, India.

Debashis Pal Prosanto Kumar Chowdhury Kaustav Nayek Nidhan K Biswas Subrata Das

Background: This study aimed to develop and validate a targeted next-generation sequencing (NGS) panel along with a data analysis algorithm capable of detecting single-nucleotide variants (SNVs) and copy number variations (CNVs) within the beta-globin gene cluster. The aim was to reduce the turnaround time in conventional genotyping methods and provide a rapid and comprehensive solution for prenatal diagnosis, carrier screening, and genotyping of β-thalassemia patients.

Methods And Results: We devised a targeted NGS panel spanning an 80.

View Article and Find Full Text PDF

Similar Publications

Protocol for mitochondrial variant enrichment from single-cell RNA sequencing using MAESTER.

STAR Protoc

January 2025

Division of Hematology, Brigham and Women's Hospital, Boston, MA, USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA; Department of Medicine, Harvard Medical School, Boston, MA, USA; Ludwig Center at Harvard, Harvard Medical School, Boston, MA, USA. Electronic address:

Jonathan D Good Ksenia R Safina Tyler E Miller Peter van Galen

Single-cell RNA sequencing (scRNA-seq) enables detailed characterization of cell states but often lacks insights into tissue clonal structures. Here, we present a protocol to probe cell states and clonal information simultaneously by enriching mitochondrial DNA (mtDNA) variants from 3'-barcoded full-length cDNA. We describe steps for input library preparation, mtDNA enrichment, PCR product cleanup, and paired-end sequencing.

View Article and Find Full Text PDF

Similar Publications

TopoQual polishes circular consensus sequencing data and accurately predicts quality scores.

BMC Bioinformatics

January 2025

Auburn University, Auburn, AL, 36849, USA.

Minindu Weerakoon Sangjin Lee Emily Mitchell Haynes Heaton

Background: Pacific Biosciences (PacBio) circular consensus sequencing (CCS), also known as high fidelity (HiFi) technology, has revolutionized modern genomics by producing long (10 + kb) and highly accurate reads. This is achieved by sequencing circularized DNA molecules multiple times and combining them into a consensus sequence. Currently, the accuracy and quality value estimation provided by HiFi technology are more than sufficient for applications such as genome assembly and germline variant calling.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!