Background: With the rapid increase in genome sequencing projects for non-model organisms, numerous genome assemblies are currently in progress or available as drafts, but not made available as satisfactory, usable genomes. Data quality assessment of genome assemblies is gaining importance not only for people who perform the assembly/re-assembly processes, but also for those who attempt to use assemblies as maps in downstream analyses. Recent studies of the quality control, quality evaluation/ assessment of genome assemblies have focused on either quality control of reads before assemblies or evaluation of the assemblies with respect to their contiguity and correctness. However, correctness assessment depends on a reference and is not applicable for de novo assembly projects. Hence, development of methods providing both post-assembly and pre-assembly quality assessment reports for examining the quality/correctness of de novo assemblies and the input reads is worth studying.

Results: We present SQUAT, an efficient tool for both pre-assembly and post-assembly quality assessment of de novo genome assemblies. The pre-assembly module of SQUAT computes quality statistics of reads and presents the analysis in a well-designed interface to visualize the distribution of high- and poor-quality reads in a portable HTML report. The post-assembly module of SQUAT provides read mapping analytics in an HTML format. We categorized reads into several groups including uniquely mapped reads, multiply mapped, unmapped reads; for uniquely mapped reads, we further categorized them into perfectly matched, with substitutions, containing clips, and the others. We carefully defined the poorly mapped (PM) reads into several groups to prevent the underestimation of unmapped reads; indeed, a high PM% would be a sign of a poor assembly that requires researchers' attention for further examination or improvements before using the assembly. Finally, we evaluate SQUAT with six datasets, including the genome assemblies for eel, worm, mushroom, and three bacteria. The results show that SQUAT reports provide useful information with details for assessing the quality of assemblies and reads.

Availability: The SQUAT software with links to both its docker image and the on-line manual is freely available at https://github.com/luke831215/SQUAT .

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7402383PMC
http://dx.doi.org/10.1186/s12864-019-5445-3DOI Listing

Publication Analysis

Top Keywords

genome assemblies
24
quality assessment
16
mapped reads
12
assemblies
11
quality
10
reads
10
data quality
8
assessment genome
8
quality control
8
module squat
8

Similar Publications

Complete genome sequence of Pseudarthrobacter sp. NIBRBAC000502770 from coal mine of Hongcheon on Republic of Korea.

BMC Genom Data

January 2025

Department of Applied Biosciences, College of Agriculture and Life Sciences, Kyungpook National University, Daegu, 41566, Republic of Korea.

Objectives: The data were collected to obtain the complete genome sequence of Pseudarthrobacter sp. NIBRBAC000502770, isolated from the rhizosphere of Sasamorpha in a heavy metal-contaminated coal mine in Hongcheon, Republic of Korea. The objective was to explore the strain's genetic potential for plant growth promotion and heavy metal resistance, particularly arsenate and copper.

View Article and Find Full Text PDF

Phragmites australis is a globally distributed grass species (Poaceae) recognized for its vast biomass and exceptional environmental adaptability, making it an ideal model for studying wetland ecosystems and plant stress resilience. However, genomic resources for this species have been limited. In this study, we assembled a chromosome-level reference genome of P.

View Article and Find Full Text PDF

Metabolism-driven chromatin dynamics: Molecular principles and technological advances.

Mol Cell

January 2025

Department of Genetics and Development and Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA. Electronic address:

Cells integrate metabolic information into core molecular processes such as transcription to adapt to environmental changes. Chromatin, the physiological template of the eukaryotic genome, has emerged as a sensor and rheostat for fluctuating intracellular metabolites. In this review, we highlight the growing list of chromatin-associated metabolites that are derived from diverse sources.

View Article and Find Full Text PDF

A tale of two strands: Decoding chromatin replication through strand-specific sequencing.

Mol Cell

January 2025

Institute for Cancer Genetics and Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY 10032, USA; Department of Pediatrics and Department of Genetics and Development, Columbia University Irving Medical Center, New York, NY 10032, USA. Electronic address:

DNA replication, a fundamental process in all living organisms, proceeds with continuous synthesis of the leading strand by DNA polymerase ε (Pol ε) and discontinuous synthesis of the lagging strand by polymerase δ (Pol δ). This inherent asymmetry at each replication fork necessitates the development of methods to distinguish between these two nascent strands in vivo. Over the past decade, strand-specific sequencing strategies, such as enrichment and sequencing of protein-associated nascent DNA (eSPAN) and Okazaki fragment sequencing (OK-seq), have become essential tools for studying chromatin replication in eukaryotic cells.

View Article and Find Full Text PDF

Rapid radiation of a plant lineage sheds light on the assembly of dry valley biomes.

Mol Biol Evol

January 2025

CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China.

Southwest China is characterized by high plateaus, large mountain systems, and deeply incised dry valleys formed by major rivers and their tributaries. Despite the considerable attention given to alpine plant radiations in this region, the timing and mode of diversification of the numerous dry valley plant lineages remain unknown. To address this knowledge gap, we investigated the macroevolution of Isodon (Lamiaceae), a lineage commonly distributed in the dry valleys in southwest China and wetter areas of Asia and Africa.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!