Background: Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking.

Methods: We present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups.

Results: We demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads).

Conclusion: More work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4423727PMC
http://dx.doi.org/10.1186/1471-2105-16-S7-S5DOI Listing

Publication Analysis

Top Keywords

genome assembly
16
diploid genome
12
machine learning
12
homologous sequences
8
learning classification
8
homologous
5
diploid
5
scaffolding
5
heterozygous genome
4
assembly
4

Similar Publications

Complete genome sequence of Pseudarthrobacter sp. NIBRBAC000502770 from coal mine of Hongcheon on Republic of Korea.

BMC Genom Data

January 2025

Department of Applied Biosciences, College of Agriculture and Life Sciences, Kyungpook National University, Daegu, 41566, Republic of Korea.

Objectives: The data were collected to obtain the complete genome sequence of Pseudarthrobacter sp. NIBRBAC000502770, isolated from the rhizosphere of Sasamorpha in a heavy metal-contaminated coal mine in Hongcheon, Republic of Korea. The objective was to explore the strain's genetic potential for plant growth promotion and heavy metal resistance, particularly arsenate and copper.

View Article and Find Full Text PDF

Phragmites australis is a globally distributed grass species (Poaceae) recognized for its vast biomass and exceptional environmental adaptability, making it an ideal model for studying wetland ecosystems and plant stress resilience. However, genomic resources for this species have been limited. In this study, we assembled a chromosome-level reference genome of P.

View Article and Find Full Text PDF

Metabolism-driven chromatin dynamics: Molecular principles and technological advances.

Mol Cell

January 2025

Department of Genetics and Development and Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA. Electronic address:

Cells integrate metabolic information into core molecular processes such as transcription to adapt to environmental changes. Chromatin, the physiological template of the eukaryotic genome, has emerged as a sensor and rheostat for fluctuating intracellular metabolites. In this review, we highlight the growing list of chromatin-associated metabolites that are derived from diverse sources.

View Article and Find Full Text PDF

A tale of two strands: Decoding chromatin replication through strand-specific sequencing.

Mol Cell

January 2025

Institute for Cancer Genetics and Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA; Department of Pediatrics and Department of Genetics and Development, Columbia University Irving Medical Center, New York, NY 10032, USA. Electronic address:

DNA replication, a fundamental process in all living organisms, proceeds with continuous synthesis of the leading strand by DNA polymerase ε (Pol ε) and discontinuous synthesis of the lagging strand by polymerase δ (Pol δ). This inherent asymmetry at each replication fork necessitates the development of methods to distinguish between these two nascent strands in vivo. Over the past decade, strand-specific sequencing strategies, such as enrichment and sequencing of protein-associated nascent DNA (eSPAN) and Okazaki fragment sequencing (OK-seq), have become essential tools for studying chromatin replication in eukaryotic cells.

View Article and Find Full Text PDF

Rapid radiation of a plant lineage sheds light on the assembly of dry valley biomes.

Mol Biol Evol

January 2025

CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China.

Southwest China is characterized by high plateaus, large mountain systems, and deeply incised dry valleys formed by major rivers and their tributaries. Despite the considerable attention given to alpine plant radiations in this region, the timing and mode of diversification of the numerous dry valley plant lineages remain unknown. To address this knowledge gap, we investigated the macroevolution of Isodon (Lamiaceae), a lineage commonly distributed in the dry valleys in southwest China and wetter areas of Asia and Africa.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!