We compare the results of three different assembler programs, Celera, Phrap and Mira2, for the same set of about a hundred thousand Sanger reads derived from an unknown bacterial genome. In difference to previous assembly comparisons we do not focus on speed of computation and numbers of assembled contigs but on how the different sequence assemblies agree by content. Threefold consistently assembled genome regions are identified in order to estimate a lower bound of erroneously identified single nucleotide polymorphisms (SNP) caused by nothing but the process of mathematical sequence assembly. We identified 509 sequence triplets common to all three de-novo assemblies spanning only 34% (3.3 Mb) of the bacterial genome with 175 of these regions (~1.5 Mb) including erroneous SNPs and insertion/deletions. Within these triplets this on average leads to one error per 7,155 base pairs. Replacing the assembler Mira2 by the most recent version Mira3, the letter number even drops to 5,923. Our results therefore suggest that a considerably high number of erroneous SNPs may be present in current sequence data and mathematicians should urgently take up research on numerical stability of sequence assembly algorithms. Furthermore, even the latest versions of currently used assemblers produce erroneous SNPs that depend on the order reads are used as input. Such errors will severely hamper molecular diagnostics as well as relating genome variation and disease. This issue needs to be addressed urgently as the field is moving fast into clinical applications.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4510600PMC
http://dx.doi.org/10.4137/GEI.S3653DOI Listing

Publication Analysis

Top Keywords

erroneous snps
12
single nucleotide
8
nucleotide polymorphisms
8
bacterial genome
8
sequence assembly
8
sequence
5
polymorphisms caused
4
assembly
4
caused assembly
4
assembly errors
4

Similar Publications

Background: The allo-octoploid Fragaria x ananassa follows disomic inheritance, yet the high sequence similarity among its subgenomes can lead to misalignment of short sequencing reads (150 bp). This misalignment results in an increased number of erroneous variants during variant calling. To accurately associate traits with the appropriate subgenome, it is essential to filter out these erroneous variants.

View Article and Find Full Text PDF

Wild populations are increasingly threatened by human-mediated climate change and land use changes. As populations decline, the probability of inbreeding increases, along with the potential for negative effects on individual fitness. Detecting and characterizing runs of homozygosity (ROHs) is a popular strategy for assessing the extent of individual inbreeding present in a population and can also shed light on the genetic mechanisms contributing to inbreeding depression.

View Article and Find Full Text PDF

Comparison of commercial targeted amplicon sequencing assays for human remains identification casework.

Int J Legal Med

January 2025

National DNA Program for Unidentified and Missing Persons, Australian Federal Police, Majura, ACT, Australia.

Targeted amplicon sequencing (TAS) facilitates the genotyping of forensically informative single nucleotide polymorphisms (SNPs) using massively parallel sequencing (MPS). For human remains identification, where any extracted DNA is likely to be degraded, TAS may succeed when short tandem repeat (STR) profiling using capillary electrophoresis fails. Further, as well as yielding identity information, SNPs can provide information about ancestry, phenotype, kinship and paternal lineage (Y chromosome haplotypes).

View Article and Find Full Text PDF

Retention of ES cell-derived 129S genome drives NLRP1 hypersensitivity and transcriptional deregulation in Nlrp3 mice.

Cell Death Differ

December 2024

Institute of Innate Immunity, Department for Systems Immunology and Proteomics, Medical Faculty, University of Bonn, Bonn, Germany.

Article Synopsis
  • Immune response genes vary widely among humans and mice, leading to different defense responses based on genetic differences, which can complicate research outcomes.
  • This study used RNA sequencing and variant analysis to uncover significant genetic variations in a commonly used mouse model of NLRP3 deficiency, highlighting the impact of the Nlrp1 locus on macrophage activation independently of the NLRP3 status.
  • The research provides a method to identify important genetic variations and evaluate contamination in transgenic mice studies, enhancing the reliability of findings in host defense research.
View Article and Find Full Text PDF

An accurate haplotyping method using multiplex pyrosequencing with AS-PCR to detect ABCB1 haplotypes associated with rivaroxaban-derived hemorrhagic events.

Talanta

January 2025

School of Life Science and Technology, China Pharmaceutical University, Nanjing, 210009, China; Department of Clinical Pharmacy, Jinling Hospital, Nanjing, 210002, China. Electronic address:

In clinical practice, owing to the comprehensive genetic insights they offer, haplotypes have attracted greater attention than individual single nucleotide polymorphisms (SNPs). Due to the long distances across SNP locations, detecting the haplotype using genomic DNA is challenging. Current haplotyping methods are either expensive and labor-intensive (high-throughput DNA sequencing), or haplotyping a single clinical sample (computational approach) is impossible.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!