Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2010.0127DOI Listing

Publication Analysis

Top Keywords

data sets
12
error
8
error detection
8
errors reads
8
sequence assembly
8
error regions
8
simulated data
8
sets tested
8
location errors
8
errors
7

Similar Publications

Machine learning (ML) is a powerful tool for the automated data analysis of molecular dynamics (MD) simulations. Recent studies showed that ML models can be used to identify protein-ligand unbinding pathways and understand the underlying mechanism. To expedite the examination of MD simulations, we constructed PathInHydro, a set of supervised ML models capable of automatically assigning unbinding pathways for the dissociation of gas molecules from [NiFe] hydrogenases, using the unbinding trajectories of CO and H from [NiFe] hydrogenase as a training set.

View Article and Find Full Text PDF

The PRIDE database is the largest public data repository of mass spectrometry-based proteomics data and currently stores more than 40,000 data sets covering a wide range of organisms, experimental techniques, and biological conditions. During the past few years, PRIDE has seen a significant increase in the amount of submitted data-independent acquisition (DIA) proteomics data sets. This provides an excellent opportunity for large-scale data reanalysis and reuse.

View Article and Find Full Text PDF

The admixture model is widely applied to estimate and interpret population structure among individuals. Here we consider a "standard admixture" model that assumes the admixed populations are unrelated and also a generalized model, where the admixed populations themselves are related via coancestry (or covariance) of allele frequencies. The generalized model yields a potentially more realistic and substantially more flexible model that we call "super admixture".

View Article and Find Full Text PDF

NEBULA101: an open dataset for the study of language aptitude in behaviour, brain structure and function.

Sci Data

January 2025

Brain and Language Lab, Department of Psychology, Faculty of Psychology and Education Science, University of Geneva, Geneva, Switzerland.

This paper introduces the "NEBULA101 - Neuro-behavioural Understanding of Language Aptitude" dataset, which comprises behavioural and brain imaging data from 101 healthy adults to examine individual differences in language and cognition. Human language, a multifaceted behaviour, varies significantly among individuals, at different processing levels. Recent advances in cognitive science have embraced an integrated approach, combining behavioural and brain studies to explore these differences comprehensively.

View Article and Find Full Text PDF

Semisupervised Contrastive Learning for Bioactivity Prediction Using Cell Painting Image Data.

J Chem Inf Model

January 2025

Research Unit Structural Chemistry and Computational Biophysics, Leibniz-Forschungsinstitut für Molekulare Pharmakologie, Berlin 13125, Germany.

Morphological profiling has recently demonstrated remarkable potential for identifying the biological activities of small molecules. Alongside the fully supervised and self-supervised machine learning methods recently proposed for bioactivity prediction from Cell Painting image data, we introduce here a semisupervised contrastive (SemiSupCon) learning approach. This approach combines the strengths of using biological annotations in supervised contrastive learning and leveraging large unannotated image data sets with self-supervised contrastive learning.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!