Motivation: Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution.
Results: We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them.
Availability And Implementation: The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10868340 | PMC |
http://dx.doi.org/10.1093/bioinformatics/btae043 | DOI Listing |
Bioinformatics
February 2024
The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.
Motivation: Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution.
View Article and Find Full Text PDFBMC Bioinformatics
March 2023
Department of Laboratory Medicine, AZ Delta General Hospital, Deltalaan 1, 8800, Roeselare, Belgium.
Background: DNA mismatch repair deficiency (dMMR) testing is crucial for detection of microsatellite unstable (MSI) tumors. MSI is detected by aberrant indel length distributions of microsatellite markers, either by visual inspection of PCR-fragment length profiles or by automated bioinformatic scoring on next-generation sequencing (NGS) data. The former is time-consuming and low-throughput while the latter typically relies on simplified binary scoring of a single parameter of the indel distribution.
View Article and Find Full Text PDFFront Bioeng Biotechnol
January 2020
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
Although genome sequencing has become increasingly popular, the simulation of individual genomes is still important. This is because sequencing a large number of individual genomes is costly and genome data with extreme and boundary conditions, such as fatal genetic defects, are difficult to obtain. Privacy and legal barriers also prevent many applications of real data.
View Article and Find Full Text PDFJ Alzheimers Dis
July 2019
Department of Pathology, Hospital Universitari and Health Sciences Research Institute Germans Trias i Pujol, Universitat Autònoma de Barcelona, Spain.
Lewy body diseases (LBD) include Parkinson's disease (PD) and dementia with Lewy bodies (DLB) and together with Alzheimer's disease (AD) they show an important neuropathological and clinical overlap. The human alpha- and beta-synuclein genes (SNCA and SNCB) are key factors for the development of Lewy body diseases. Here, we aimed to analyze the genotype distribution of potentially functional SNPs in SNCA and SNCB, perform haplotype analysis for SNCB, and to identify functional insertion and deletion (INDEL) variations within the regulatory region of SNCB which might be responsible for the drastically diminished beta-synuclein levels reported for pure DLB.
View Article and Find Full Text PDFBMC Bioinformatics
August 2016
Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, 820-8502, Japan.
Background: Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment. However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!