Heterozygous genome assembly via binary classification of homologous sequence.

Paul M Bodily M Fujimoto Cameron Ortega Nozomu Okuda Jared C Price Mark J Clement Quinn Snell

BMC Bioinformatics

Published: October 2015

Background: Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking.

Methods: We present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups.

Results: We demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads).

Conclusion: More work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4423727	PMC
http://dx.doi.org/10.1186/1471-2105-16-S7-S5	DOI Listing

Publication Analysis

Top Keywords

genome assembly

diploid genome

machine learning

homologous sequences

learning classification

homologous

diploid

scaffolding

heterozygous genome

assembly

Similar Publications

Complete genome sequence of Pseudarthrobacter sp. NIBRBAC000502770 from coal mine of Hongcheon on Republic of Korea.

BMC Genom Data

January 2025

Department of Applied Biosciences, College of Agriculture and Life Sciences, Kyungpook National University, Daegu, 41566, Republic of Korea.

Min-Kyu Park Yeong-Jun Park Myung-Suk Kang Min-Ha Kim Soo-Young Kim

Objectives: The data were collected to obtain the complete genome sequence of Pseudarthrobacter sp. NIBRBAC000502770, isolated from the rhizosphere of Sasamorpha in a heavy metal-contaminated coal mine in Hongcheon, Republic of Korea. The objective was to explore the strain's genetic potential for plant growth promotion and heavy metal resistance, particularly arsenate and copper.

View Article and Find Full Text PDF

Similar Publications

Telomere-to-telomere Phragmites australis reference genome assembly with a B chromosome provides insights into its evolution and polysaccharide biosynthesis.

Commun Biol

January 2025

College of Life Sciences, Capital Normal University, Haidian District, Beijing, China.

Jipeng Cui Rui Wang Ruoqing Gu Minghui Chen Ziyao Wang

Phragmites australis is a globally distributed grass species (Poaceae) recognized for its vast biomass and exceptional environmental adaptability, making it an ideal model for studying wetland ecosystems and plant stress resilience. However, genomic resources for this species have been limited. In this study, we assembled a chromosome-level reference genome of P.

View Article and Find Full Text PDF

Similar Publications

Metabolism-driven chromatin dynamics: Molecular principles and technological advances.

Mol Cell

January 2025

Department of Genetics and Development and Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA. Electronic address:

Varun Sahu Chao Lu

Cells integrate metabolic information into core molecular processes such as transcription to adapt to environmental changes. Chromatin, the physiological template of the eukaryotic genome, has emerged as a sensor and rheostat for fluctuating intracellular metabolites. In this review, we highlight the growing list of chromatin-associated metabolites that are derived from diverse sources.

View Article and Find Full Text PDF

Similar Publications

A tale of two strands: Decoding chromatin replication through strand-specific sequencing.

Mol Cell

January 2025

Institute for Cancer Genetics and Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA; Department of Pediatrics and Department of Genetics and Development, Columbia University Irving Medical Center, New York, NY 10032, USA. Electronic address:

Zhiming Li Zhiguo Zhang

DNA replication, a fundamental process in all living organisms, proceeds with continuous synthesis of the leading strand by DNA polymerase ε (Pol ε) and discontinuous synthesis of the lagging strand by polymerase δ (Pol δ). This inherent asymmetry at each replication fork necessitates the development of methods to distinguish between these two nascent strands in vivo. Over the past decade, strand-specific sequencing strategies, such as enrichment and sequencing of protein-associated nascent DNA (eSPAN) and Okazaki fragment sequencing (OK-seq), have become essential tools for studying chromatin replication in eukaryotic cells.

View Article and Find Full Text PDF

Similar Publications

Rapid radiation of a plant lineage sheds light on the assembly of dry valley biomes.

Mol Biol Evol

January 2025

CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China.

Ya-Ping Chen Purayidathkandy Sunojkumar Robert A Spicer Richard G J Hodel Douglas E Soltis

Southwest China is characterized by high plateaus, large mountain systems, and deeply incised dry valleys formed by major rivers and their tributaries. Despite the considerable attention given to alpine plant radiations in this region, the timing and mode of diversification of the numerous dry valley plant lineages remain unknown. To address this knowledge gap, we investigated the macroevolution of Isodon (Lamiaceae), a lineage commonly distributed in the dry valleys in southwest China and wetter areas of Asia and Africa.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!