A digital twin for DNA data storage based on comprehensive quantification of errors and biases.

Nat Commun

Department of Chemistry and Applied Biosciences, ETH Zürich, Vladimir-Prelog-Weg 1-5, 8093, Zürich, Switzerland.

Published: September 2023

Archiving data in synthetic DNA offers unprecedented storage density and longevity. Handling and storage introduce errors and biases into DNA-based storage systems, necessitating the use of Error Correction Coding (ECC) which comes at the cost of added redundancy. However, insufficient data on these errors and biases, as well as a lack of modeling tools, limit data-driven ECC development and experimental design. In this study, we present a comprehensive characterisation of the error sources and biases present in the most common DNA data storage workflows, including commercial DNA synthesis, PCR, decay by accelerated aging, and sequencing-by-synthesis. Using the data from 40 sequencing experiments, we build a digital twin of the DNA data storage process, capable of simulating state-of-the-art workflows and reproducing their experimental results. We showcase the digital twin's ability to replace experiments and rationalize the design of redundancy in two case studies, highlighting opportunities for tangible cost savings and data-driven ECC development.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10533828PMC
http://dx.doi.org/10.1038/s41467-023-41729-1DOI Listing

Publication Analysis

Top Keywords

dna data
12
data storage
12
errors biases
12
digital twin
8
twin dna
8
data-driven ecc
8
ecc development
8
data
6
storage
6
dna
5

Similar Publications

Successful Diagnosis of Sengers Syndrome Using a Comprehensive Genomic Analysis.

Mol Genet Genomic Med

January 2025

Diagnostics and Therapeutics of Intractable Diseases, Intractable Disease Research Center, Graduate School of Medicine, Juntendo University, Tokyo, Japan.

Background: Sengers syndrome is an autosomal recessive mitochondrial DNA depletion syndrome characterized by hypertrophic cardiomyopathy, congenital cataracts, skeletal myopathy, exercise intolerance, and lactic acidosis. Dysfunction of acylglycerol kinase (AGK) is responsible for the disease, and several AGK gene variants have been reported.

Methods: We employed a comprehensive genomic analysis approach, including whole-genome sequencing and RNA sequencing, combined with various bioinformatics tools.

View Article and Find Full Text PDF

Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA language model (GENA-LM), a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 000 base pairs.

View Article and Find Full Text PDF

This review discusses the possibility of inheritance of some diseases through mutations in mitochondrial DNA. These are examples of many mitochondrial diseases that can be caused by mutations in mitochondrial DNA. Symptoms and severity can vary widely depending on the specific mutation and affected tissues.

View Article and Find Full Text PDF

Introduction: Schistosomiasis (Bilharzia), a neglected tropical disease caused by parasites, afflicts over 240 million people globally, disproportionately impacting Sub-Saharan Africa. Current diagnostic tests, despite their utility, suffer from limitations like low sensitivity. Polymerase chain reaction (PCR) and quantitative real-time PCR (qPCR) remain the most common and sensitive nucleic acid amplification tests.

View Article and Find Full Text PDF

The eastern or Tasmanian bettong ( ) is one of four extant bettong species and is listed as 'Near Threatened' by the IUCN. We sequenced short read data on the 10x system to generate a reference genome 3.46Gb in size and contig N50 of 87.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!