We compare the results of three different assembler programs, Celera, Phrap and Mira2, for the same set of about a hundred thousand Sanger reads derived from an unknown bacterial genome. In difference to previous assembly comparisons we do not focus on speed of computation and numbers of assembled contigs but on how the different sequence assemblies agree by content. Threefold consistently assembled genome regions are identified in order to estimate a lower bound of erroneously identified single nucleotide polymorphisms (SNP) caused by nothing but the process of mathematical sequence assembly. We identified 509 sequence triplets common to all three de-novo assemblies spanning only 34% (3.3 Mb) of the bacterial genome with 175 of these regions (~1.5 Mb) including erroneous SNPs and insertion/deletions. Within these triplets this on average leads to one error per 7,155 base pairs. Replacing the assembler Mira2 by the most recent version Mira3, the letter number even drops to 5,923. Our results therefore suggest that a considerably high number of erroneous SNPs may be present in current sequence data and mathematicians should urgently take up research on numerical stability of sequence assembly algorithms. Furthermore, even the latest versions of currently used assemblers produce erroneous SNPs that depend on the order reads are used as input. Such errors will severely hamper molecular diagnostics as well as relating genome variation and disease. This issue needs to be addressed urgently as the field is moving fast into clinical applications.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4510600 | PMC |
http://dx.doi.org/10.4137/GEI.S3653 | DOI Listing |
BMC Genomics
November 2024
Wageningen University and Research Plant Breeding, Wageningen, The Netherlands.
Background: The allo-octoploid Fragaria x ananassa follows disomic inheritance, yet the high sequence similarity among its subgenomes can lead to misalignment of short sequencing reads (150 bp). This misalignment results in an increased number of erroneous variants during variant calling. To accurately associate traits with the appropriate subgenome, it is essential to filter out these erroneous variants.
View Article and Find Full Text PDFPLoS Comput Biol
October 2024
College of Forestry, Wildlife, and Environment, Auburn University, Auburn, Alabama, United States of America.
Wild populations are increasingly threatened by human-mediated climate change and land use changes. As populations decline, the probability of inbreeding increases, along with the potential for negative effects on individual fitness. Detecting and characterizing runs of homozygosity (ROHs) is a popular strategy for assessing the extent of individual inbreeding present in a population and can also shed light on the genetic mechanisms contributing to inbreeding depression.
View Article and Find Full Text PDFInt J Legal Med
January 2025
National DNA Program for Unidentified and Missing Persons, Australian Federal Police, Majura, ACT, Australia.
Targeted amplicon sequencing (TAS) facilitates the genotyping of forensically informative single nucleotide polymorphisms (SNPs) using massively parallel sequencing (MPS). For human remains identification, where any extracted DNA is likely to be degraded, TAS may succeed when short tandem repeat (STR) profiling using capillary electrophoresis fails. Further, as well as yielding identity information, SNPs can provide information about ancestry, phenotype, kinship and paternal lineage (Y chromosome haplotypes).
View Article and Find Full Text PDFCell Death Differ
December 2024
Institute of Innate Immunity, Department for Systems Immunology and Proteomics, Medical Faculty, University of Bonn, Bonn, Germany.
Talanta
January 2025
School of Life Science and Technology, China Pharmaceutical University, Nanjing, 210009, China; Department of Clinical Pharmacy, Jinling Hospital, Nanjing, 210002, China. Electronic address:
In clinical practice, owing to the comprehensive genetic insights they offer, haplotypes have attracted greater attention than individual single nucleotide polymorphisms (SNPs). Due to the long distances across SNP locations, detecting the haplotype using genomic DNA is challenging. Current haplotyping methods are either expensive and labor-intensive (high-throughput DNA sequencing), or haplotyping a single clinical sample (computational approach) is impossible.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!