Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6126949PMC
http://dx.doi.org/10.1038/nbt.4227DOI Listing

Publication Analysis

Top Keywords

variation graphs
12
variation
8
read mapping
8
genetic variation
8
dna sequence
8
variation population
8
variation graph
4
graph toolkit
4
toolkit improves
4
improves read
4

Similar Publications

Traffic flow prediction is a pivotal element in Intelligent Transportation Systems (ITSs) that provides significant opportunities for real-world applications. Capturing complex and dynamic spatio-temporal patterns within traffic data remains a significant challenge for traffic flow prediction. Different approaches to effectively modeling complex spatio-temporal correlations within traffic data have been proposed.

View Article and Find Full Text PDF

Evolutionary dynamics of mitochondrial genomes and intracellular transfers among diploid and allopolyploid cotton species.

BMC Biol

January 2025

Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China.

Background: Plant mitochondrial genomes (mitogenomes) exhibit extensive structural variation yet extremely low nucleotide mutation rates, phenomena that remain only partially understood. The genus Gossypium, a globally important source of cotton, offers a wealth of long-read sequencing resources to explore mitogenome and plastome variation and dynamics accompanying the evolutionary divergence of its approximately 50 diploid and allopolyploid species.

Results: Here, we assembled 19 mitogenomes from Gossypium species, representing all genome groups (diploids A through G, K, and the allopolyploids AD) based on a uniformly applied strategy.

View Article and Find Full Text PDF

Background: Fruit acidity and color are important quality attributes in peaches. Although there are some exceptions, blood-fleshed peaches typically have a sour taste. However, little is known about the genetic variations linking organic acid and color regulation in peaches.

View Article and Find Full Text PDF

Network-based transfer of pan-cancer immunotherapy responses to guide breast cancer prognosis.

NPJ Syst Biol Appl

January 2025

Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou, China.

Breast cancer prognosis is complicated by tumor heterogeneity. Traditional methods focus on cancer-specific gene signatures, but cross-cancer strategies that provide deeper insights into tumor homogeneity are rarely used. Immunotherapy, particularly immune checkpoint inhibitors, results from variable responses across cancers, offering valuable prognostic insights.

View Article and Find Full Text PDF

Proteoform Identification and Quantification Based on Alignment Graphs.

Bioinformatics

January 2025

Department of Computer Science, City University of Hong Kong, Hong Kong, China.

Motivation: Proteoforms are the different forms of a proteins generated from the genome with various sequence variations, splice isoforms, and post-translational modifications. Proteoforms regulate protein structures and functions. A single protein can have multiple proteoforms due to different modification sites.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!