NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning.

BMC Genomics

Department of Electronic Engineering, Information School, Yunnan University, Kunming, Yunnan, China.

Published: June 2024

Backgrounds: The single-pass long reads generated by third-generation sequencing technology exhibit a higher error rate. However, the circular consensus sequencing (CCS) produces shorter reads. Thus, it is effective to manage the error rate of long reads algorithmically with the help of the homologous high-precision and low-cost short reads from the Next Generation Sequencing (NGS) technology.

Methods: In this work, a hybrid error correction method (NmTHC) based on a generative neural machine translation model is proposed to automatically capture discrepancies within the aligned regions of long reads and short reads, as well as the contextual relationships within the long reads themselves for error correction. Akin to natural language sequences, the long read can be regarded as a special "genetic language" and be processed with the idea of generative neural networks. The algorithm builds a sequence-to-sequence(seq2seq) framework with Recurrent Neural Network (RNN) as the core layer. The before and post-corrected long reads are regarded as the sentences in the source and target language of translation, and the alignment information of long reads with short reads is used to create the special corpus for training. The well-trained model can be used to predict the corrected long read.

Results: NmTHC outperforms the latest mainstream hybrid error correction methods on real-world datasets from two mainstream platforms, including PacBio and Nanopore. Our experimental evaluation results demonstrate that NmTHC can align more bases with the reference genome without any segmenting in the six benchmark datasets, proving that it enhances alignment identity without sacrificing any length advantages of long reads.

Conclusion: Consequently, NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a novel perspective for solving long-read error correction problems with the ideas of Natural Language Processing (NLP). More remarkably, the proposed methodology is sequencing-technology-independent and can produce more precise reads.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11157743	PMC
http://dx.doi.org/10.1186/s12864-024-10446-4	DOI Listing

Publication Analysis

Top Keywords

error correction

long reads

hybrid error

generative neural

machine translation

neural machine

short reads

reads

long

error

Similar Publications

RETRACTION: Investigating the Effectiveness of Forensic Genetics and Population Genetic Diversity Using a Multi-InDel System in Chinese Hezhou and Southern Shaanxi Han Populations.

Ann Hum Genet

January 2025

Wang, X., Q. Lan, Y.

View Article and Find Full Text PDF

Similar Publications

Motion-Compensated Multishot Pancreatic Diffusion-Weighted Imaging With Deep Learning-Based Denoising.

Invest Radiol

January 2025

From the Department of Radiology, Stanford University, Stanford, CA (K.W., M.J.M., A.M.L., A.B.S., A.J.H., D.B.E., R.L.B.); Department of Radiology and Biomedical Imaging, University of California, San Francisco, San Francisco, CA (K.W.); GE HealthCare, Houston, TX (X.W.); GE HealthCare, Boston, MA (A.G.); and GE HealthCare, Menlo Park, CA (P.L.).

Kang Wang Matthew J Middione Andreas M Loening Ali B Syed Ariel J Hannum

Objectives: Pancreatic diffusion-weighted imaging (DWI) has numerous clinical applications, but conventional single-shot methods suffer from off resonance-induced artifacts like distortion and blurring while cardiovascular motion-induced phase inconsistency leads to quantitative errors and signal loss, limiting its utility. Multishot DWI (msDWI) offers reduced image distortion and blurring relative to single-shot methods but increases sensitivity to motion artifacts. Motion-compensated diffusion-encoding gradients (MCGs) reduce motion artifacts and could improve motion robustness of msDWI but come with the cost of extended echo time, further reducing signal.

View Article and Find Full Text PDF

Similar Publications

Dispersionless Nonhybrid Density Functional.

J Chem Theory Comput

January 2025

Department of Physics and Astronomy, University of Delaware, Newark, Delaware 19716, United States.

Atta Ur Rehman Krzysztof Szalewicz

A dispersion-corrected density functional theory (DFT+D) method has been developed. It includes a nonhybrid dispersionless generalized gradient approximation (GGA) functional paired with a literature-parametrized dispersion function. The functional's 9 adjustable parameters were optimized using a training set of 589 benchmark interaction energies.

View Article and Find Full Text PDF

Similar Publications

Artificial Intelligence Scribe and Large Language Model Technology in Healthcare Documentation: Advantages, Limitations, and Recommendations.

Plast Reconstr Surg Glob Open

January 2025

Department of Computer Science, Johns Hopkins University, Baltimore, MD.

Sarah A Mess Alison J Mackey David E Yarowsky

Artificial intelligence (AI) scribe applications in the healthcare community are in the early adoption phase and offer unprecedented efficiency for medical documentation. They typically use an application programming interface with a large language model (LLM), for example, generative pretrained transformer 4. They use automatic speech recognition on the physician-patient interaction, generating a full medical note for the encounter, together with a draft follow-up e-mail for the patient and, often, recommendations, all within seconds or minutes.

View Article and Find Full Text PDF

Similar Publications

Adjusting for bias due to measurement error in functional quantile regression models with error-prone functional and scalar covariates.

Biostat Epidemiol

October 2024

Department of Epidemiology and Biostatistics, Indiana University, Bloomington, Indiana, US.

Xiwei Chen Heyang Ji Yuanyuan Luan Roger S Zoh Lan Xue

Wearable devices enable the continuous monitoring of physical activity (PA) but generate complex functional data with poorly characterized errors. Most work on functional data views the data as smooth, latent curves obtained at discrete time intervals with some random noise with mean zero and constant variance. Viewing this noise as homoscedastic and independent ignores potential serial correlations.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!