Background: Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking.
Methods: We present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups.
Results: We demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads).
Conclusion: More work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4423727 | PMC |
http://dx.doi.org/10.1186/1471-2105-16-S7-S5 | DOI Listing |
BMC Genom Data
January 2025
Department of Applied Biosciences, College of Agriculture and Life Sciences, Kyungpook National University, Daegu, 41566, Republic of Korea.
Objectives: The data were collected to obtain the complete genome sequence of Pseudarthrobacter sp. NIBRBAC000502770, isolated from the rhizosphere of Sasamorpha in a heavy metal-contaminated coal mine in Hongcheon, Republic of Korea. The objective was to explore the strain's genetic potential for plant growth promotion and heavy metal resistance, particularly arsenate and copper.
View Article and Find Full Text PDFCommun Biol
January 2025
College of Life Sciences, Capital Normal University, Haidian District, Beijing, China.
Phragmites australis is a globally distributed grass species (Poaceae) recognized for its vast biomass and exceptional environmental adaptability, making it an ideal model for studying wetland ecosystems and plant stress resilience. However, genomic resources for this species have been limited. In this study, we assembled a chromosome-level reference genome of P.
View Article and Find Full Text PDFMol Cell
January 2025
Department of Genetics and Development and Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA. Electronic address:
Cells integrate metabolic information into core molecular processes such as transcription to adapt to environmental changes. Chromatin, the physiological template of the eukaryotic genome, has emerged as a sensor and rheostat for fluctuating intracellular metabolites. In this review, we highlight the growing list of chromatin-associated metabolites that are derived from diverse sources.
View Article and Find Full Text PDFMol Cell
January 2025
Institute for Cancer Genetics and Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA; Department of Pediatrics and Department of Genetics and Development, Columbia University Irving Medical Center, New York, NY 10032, USA. Electronic address:
DNA replication, a fundamental process in all living organisms, proceeds with continuous synthesis of the leading strand by DNA polymerase ε (Pol ε) and discontinuous synthesis of the lagging strand by polymerase δ (Pol δ). This inherent asymmetry at each replication fork necessitates the development of methods to distinguish between these two nascent strands in vivo. Over the past decade, strand-specific sequencing strategies, such as enrichment and sequencing of protein-associated nascent DNA (eSPAN) and Okazaki fragment sequencing (OK-seq), have become essential tools for studying chromatin replication in eukaryotic cells.
View Article and Find Full Text PDFMol Biol Evol
January 2025
CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China.
Southwest China is characterized by high plateaus, large mountain systems, and deeply incised dry valleys formed by major rivers and their tributaries. Despite the considerable attention given to alpine plant radiations in this region, the timing and mode of diversification of the numerous dry valley plant lineages remain unknown. To address this knowledge gap, we investigated the macroevolution of Isodon (Lamiaceae), a lineage commonly distributed in the dry valleys in southwest China and wetter areas of Asia and Africa.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!