Semantic Assembly and Annotation of Draft RNAseq Transcripts without a Reference Genome.

PLoS One

Whitney Laboratory for Marine Biosciences, University of Florida; 9505 Ocean Shore Blvd, Saint Augustine, FL, 32080, United States of America; Department of Physiology and Functional Genomics, University of Florida; 9505 Ocean Shore Blvd, Saint Augustine, FL, 32080, United States of America.

Published: May 2016

Transcriptomes are one of the first sources of high-throughput genomic data that have benefitted from the introduction of Next-Gen Sequencing. As sequencing technology becomes more accessible, transcriptome sequencing is applicable to multiple organisms for which genome sequences are unavailable. Currently all methods for de novo assembly are based on the concept of matching the nucleotide context overlapping between short fragments-reads. However, even short reads may still contain biologically relevant information which can be used as hints in guiding the assembly process. We propose a computational workflow for the reconstruction and functional annotation of expressed gene transcripts that does not require a reference genome sequence and can be tolerant to low coverage, high error rates and other issues that often lead to poor results of de novo assembly in studies of non-model organisms. We start with either raw sequences or the output of a context-based de novo transcriptome assembly. Instead of mapping reads to a reference genome or creating a completely unsupervised clustering of reads, we assemble the unknown transcriptome using nearest homologs from a public database as seeds. We consider even distant relations, indirectly linking protein-coding fragments to entire gene families in multiple distantly related genomes. The intended application of the proposed method is an additional step of semantic (based on relations between protein-coding fragments) scaffolding following traditional (i.e. based on sequence overlap) de novo assembly. The method we developed was effective in analysis of the jellyfish Cyanea capillata transcriptome and may be applicable in other studies of gene expression in species lacking a high quality reference genome sequence. Our algorithms are implemented in C and designed for parallel computation using a high-performance computer. The software is available free of charge via an open source license.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4578894PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0138006PLOS

Publication Analysis

Top Keywords

reference genome
16
novo assembly
12
genome sequence
8
protein-coding fragments
8
genome
5
assembly
5
semantic assembly
4
assembly annotation
4
annotation draft
4
draft rnaseq
4

Similar Publications

Blood-based epigenome-wide association study and prediction of alcohol consumption.

Clin Epigenetics

January 2025

Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.

Alcohol consumption is an important risk factor for multiple diseases. It is typically assessed via self-report, which is open to measurement error through recall bias. Instead, molecular data such as blood-based DNA methylation (DNAm) could be used to derive a more objective measure of alcohol consumption by incorporating information from cytosine-phosphate-guanine (CpG) sites known to be linked to the trait.

View Article and Find Full Text PDF

The α-globin super-enhancer acts in an orientation-dependent manner.

Nat Commun

January 2025

Gene Regulation Laboratory, MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, OX3 9DS, Oxford, UK.

Individual enhancers are defined as short genomic regulatory elements, bound by transcription factors, and able to activate cell-specific gene expression at a distance, in an orientation-independent manner. Within mammalian genomes, enhancer-like elements may be found individually or within clusters referred to as locus control regions or super-enhancers (SEs). While these behave similarly to individual enhancers with respect to cell specificity, distribution and distance, their orientation-dependence has not been formally tested.

View Article and Find Full Text PDF

Classification of Fibro-osseous Tumors in the Craniofacial Bones using DNA Methylation and Copy Number Alterations.

Mod Pathol

January 2025

Department of Pathology and Medical Biology, University Medical Center Groningen, Groningen, the Netherlands; Department of Pathology, Amsterdam University Medical Center, Amsterdam, the Netherlands. Electronic address:

Fibro-osseous tumors of the craniofacial bones are a heterogeneous group of lesions comprising cemento-osseous dysplasia (COD), cemento-ossifying fibroma (COF), juvenile trabecular ossifying fibroma (JTOF), psammomatoid ossifying fibroma (PsOF), fibrous dysplasia (FD), and low-grade osteosarcoma (LGOS) with overlapping clinicopathological features. However, their clinical behavior and treatment differ significantly, underlining the need for accurate diagnosis. Molecular diagnostic markers exist for subsets of these tumors, including GNAS mutations in FD, SATB2 fusions in PsOF, mutations involving the RAS-MAPK signaling pathway in COD, and MDM2 amplification in LGOS.

View Article and Find Full Text PDF

Whole genome sequencing characterization of Clostridioides difficile from Bulgaria during the COVID-19 pandemic.

Diagn Microbiol Infect Dis

January 2025

National Reference Laboratory of Control and Monitoring of Antibiotic Resistance (NRL-CMAR), Department Microbiology, National Center of Infectious and Parasitic Diseases (NCIPD), 26 Yanko Sakazov Blvd., Sofia, Bulgaria.

Increased incidence of Clostridioides difficile infections were documented in Bulgarian hospitals during COVID-19. WGS was performed on 39 isolates from seven hospitals during 2015-2022. Antimicrobial resistance and toxin genes were inferred from genomes.

View Article and Find Full Text PDF

Genomic sources from China are underrepresented in the population-specific reference database. We performed whole-genome sequencing or genome-wide genotyping on 1,207 individuals from four linguistically diverse groups (1,081 Sinitic, 56 Mongolic, 40 Turkic, and 30 Tibeto-Burman people) living in North China included in the 10K Chinese People Genomic Diversity Project (10K_CPGDP) to characterize the genetic architecture and adaptative history of ethnic groups in the Silk Road Region of China. We observed a population split between Northwest Chinese minorities (NWCMs) and Han Chinese since the Upper Paleolithic and later Neolithic genetic differentiation within NWCMs.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!