Background: The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them.

Results: The results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99-111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644-652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086-1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134-1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5322684PMC
http://dx.doi.org/10.1186/s13015-017-0091-2DOI Listing

Publication Analysis

Top Keywords

novo transcriptome
12
transcriptome assembly
12
rna-seq reads
8
repeats rna-seq
8
rna-seq data
8
transcriptome evaluation
8
evaluation methods
8
repeats
6
transcriptome
6
assembly
6

Similar Publications

Non-peptide ligands (NPLs), including lipids, amino acids, carbohydrates, and non-peptide neurotransmitters and hormones, play a critical role in ligand-receptor-mediated cell-cell communication, driving diverse physiological and pathological processes. To facilitate the study of NPL-dependent intercellular interactions, we introduce MetaLigand, an R-based and web-accessible tool designed to infer NPL production and predict NPL-receptor interactions using transcriptomic data. MetaLigand compiles data for 233 NPLs, including their biosynthetic enzymes, transporter genes, and receptor genes, through a combination of automated pipelines and manual curation from comprehensive databases.

View Article and Find Full Text PDF

Rapid technological advancements have made it possible to generate single-cell data at a large scale. Several laboratories around the world can now generate single-cell transcriptomic data from different tissues. Unsupervised clustering, followed by annotation of the cell type of the identified clusters, is a crucial step in single-cell analyses.

View Article and Find Full Text PDF

The order Diptera (true flies) holds promise as a model taxon in evolutionary developmental biology due to the inclusion of the model organism, , and the ability to cost-effectively rear many species in laboratories. One of them, the scuttle fly (Phoridae) has been used in evolutionary developmental biology for 30 years and is an excellent phylogenetic intermediate between fruit flies and mosquitoes but remains underdeveloped in genomic resources. Here, we present a chromosome-level assembly and annotation of and transcriptomes of 9 embryonic and 4 postembryonic stages.

View Article and Find Full Text PDF

De novo transcriptome assembly of the Perna viridis: A novel invertebrate model for ecotoxicological studies.

Sci Data

January 2025

Marine Biotechnology Fish Nutrition and Health Division, Central Marine Fisheries Research Institute, Post Box No 1603 Ernakulam North PO., Kochi, 682018, Kerala, India.

Mussels, particularly Perna viridis, are vital sentinel species for toxicology and biomonitoring in environmental health. This species plays a crucial role in aquaculture and significantly impacts the fisheries sector. Despite the ecological and economic importance of this species, its omics resources are still scarce.

View Article and Find Full Text PDF

First transcriptome assembly of a new ciliate species (Protocruzia marianaensis) isolated from the Mariana Trench area.

Mar Genomics

March 2025

Fujian Key Laboratory on Conservation and Sustainable Utilization of Marine Biodiversity, Fuzhou Institute of Oceanography, College of Geography and Oceanography, Minjiang University, Fuzhou, 350108, China.

This is the first report of a transcriptome assembly of a newly discovered a new Protocruzia species sampled from the under-sampled area near the Mariana Trench. We sequenced the transcriptome of P. marianaensis using the Illumina Novaseq 6000 platform.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!