Background: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.

Results: We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data.

Conclusion: False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9195321PMC
http://dx.doi.org/10.1186/s12859-022-04754-3DOI Listing

Publication Analysis

Top Keywords

error correction
24
false-positive corrections
12
de-novo assembly
12
care
8
machine learning
8
correction
8
state-of-the-art error
8
read error
8
corrections compared
8
compared state-of-the-art
8

Similar Publications

Accurately measuring the thickness of the oxide film that accumulates on nuclear fuel assemblies is critical for maintaining nuclear power plant safety. Oxide film thickness typically ranges from a few micrometers to several tens of micrometers, necessitating a high-precision measurement system. Eddy current testing (ECT) is commonly employed during poolside inspections due to its simplicity and ease of on-site implementation.

View Article and Find Full Text PDF

The Application of Supervised Machine Learning Algorithms for Image Alignment in Multi-Channel Imaging Systems.

Sensors (Basel)

January 2025

Department of Computer-Integrated Technologies of Device Production, Faculty of Instrumentation Engineering, National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Beresteiskyi Ave., 37, 03056 Kyiv, Ukraine.

This study presents a method for aligning the geometric parameters of images in multi-channel imaging systems based on the application of pre-processing methods, machine learning algorithms, and a calibration setup using an array of orderly markers at the nodes of an imaginary grid. According to the proposed method, one channel of the system is used as a reference. The images from the calibration setup in each channel determine the coordinates of the markers, and the displacements of the marker centers in the system's channels relative to the coordinates of the centers in the reference channel are then determined.

View Article and Find Full Text PDF

Topography estimation is essential for autonomous off-road navigation. Common methods rely on point cloud data from, e.g.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!