Sensitive and error-tolerant annotation of protein-coding DNA with BATH.

bioRxiv

R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA.

Published: January 2024

We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long read sequencing data and in the context of pseudogenes.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10802276PMC
http://dx.doi.org/10.1101/2023.12.31.573773DOI Listing

Publication Analysis

Top Keywords

annotation protein-coding
8
protein-coding dna
8
indels bath
8
annotation sequences
8
annotation
7
bath
6
sensitive error-tolerant
4
error-tolerant annotation
4
dna bath
4
bath bath
4

Similar Publications

Comprehensive analysis of the multi-rings mitochondrial genome of Populus tomentosa.

BMC Genomics

January 2025

State Key Laboratory of Tree Genetics and Breeding, National Engineering Research Center of Tree Breeding and Ecological Restoration, Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, College of Biological Sciences and Technology, Beijing Forestry University, Beijing, 100083, China.

Background: Populus tomentosa, known as Chinese white poplar, is indigenous and distributed across large areas of China, where it plays multiple important roles in forestry, agriculture, conservation, and urban horticulture. However, limited accessibility to the mitochondrial (mt) genome of P. tomentosa impedes phylogenetic and population genetic analyses and restricts functional gene research in Salicaceae family.

View Article and Find Full Text PDF

Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript, we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to help assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein-coding region.

View Article and Find Full Text PDF

Long non-coding RNAs (lncRNAs) are essential components of innate immunity, maintaining the functionality of immune systems that control virus infection. However, how lncRNAs engage immune responses during influenza A virus (IAV) infection remains unclear. Here, we show that lncRNA USP30-AS1 is up-regulated by infection of multiple different IAV subtypes and is required for tuning inflammatory and antiviral response in IAV infection.

View Article and Find Full Text PDF

Thunb. (1784) is primarily distributed in eastern Asia,  has a total length of 152,778 bp and consists of a large single copy (LSC) region of 84,517 bp, a small single copy (SSC) region of 18,277 bp, and two inverted repeat (IRs) regions of 24,992 bp . The GC content is 37.

View Article and Find Full Text PDF

The complete chloroplast genome of Hemsl. 1889 (Ericaceae).

Mitochondrial DNA B Resour

December 2024

Jiangsu Key Laboratory for the Research and Utilization of Plant Resources, Institute of Botany, Jiangsu Province and Chinese Academy of Sciences (Nanjing Botanical Garden Mem. Sun Yat-Sen), Nanjing, China.

Hemsl. 1889 is an endemic deciduous shrub in China, belonging to the family Ericaceae. In this study, the first complete chloroplast genome of was assembled and annotated.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!