Background: While eukaryotic noncoding RNAs have recently received intense scrutiny, it is becoming clear that bacterial transcription is at least as pervasive. Bacterial small RNAs and antisense RNAs (sRNAs) are often assumed to be noncoding, due to their lack of long open reading frames (ORFs). However, there are numerous examples of sRNAs encoding for small proteins, whether or not they also have a regulatory role at the RNA level.
Methods: Here, we apply flexible machine learning techniques based on sequence features and comparative genomics to quantify the prevalence of sRNA ORFs under natural selection to maintain protein-coding function in 14 phylogenetically diverse bacteria. Importantly, we quantify uncertainty in our predictions, and follow up on them using mass spectrometry proteomics and comparison to datasets including ribosome profiling.
Results: A majority of annotated sRNAs have at least one ORF between 10 and 50 amino acids long, and we conservatively predict that 409±191.7 unannotated sRNA ORFs are under selection to maintain coding (mean estimate and 95% confidence interval), an average of 29 per species considered here. This implies that overall at least 10.3±0.5% of sRNAs have a coding ORF, and in some species around 20% do. 165±69 of these novel coding ORFs have some antisense overlap to annotated ORFs. As experimental validation, many of our predictions are translated in published ribosome profiling data and are identified via mass spectrometry shotgun proteomics. B. subtilis sRNAs with coding ORFs are enriched for high expression in biofilms and confluent growth, and S. pneumoniae sRNAs with coding ORFs are involved in virulence. sRNA coding ORFs are enriched for transmembrane domains and many are predicted novel components of type I toxin/antitoxin systems.
Conclusions: We predict over two dozen new protein-coding genes per bacterial species, but crucially also quantified the uncertainty in this estimate. Our predictions for sRNA coding ORFs, along with predicted novel type I toxins and tools for sorting and visualizing genomic context, are freely available in a user-friendly format at http://disco-bac.web.pasteur.fr. We expect these easily-accessible predictions to be a valuable tool for the study not only of bacterial sRNAs and type I toxin-antitoxin systems, but also of bacterial genetics and genomics.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5521070 | PMC |
http://dx.doi.org/10.1186/s12864-017-3932-y | DOI Listing |
Mol Ther Methods Clin Dev
March 2025
Department of Hematology, Leiden University Medical Center, Leiden, the Netherlands.
T cell-based immunotherapies targeting antigens on tumor cells have shown efficacy as anti-cancer treatments. While neoantigens are created by somatic mutations acquired during tumorigenesis, allogeneic stem cell transplantation as treatment for hematological malignancies exploits minor histocompatibility antigens encoded by genetic differences between patients and donors. Screening methods to predict neoantigens and minor histocompatibility antigens typically consider only conventional antigens created by nonsynonymous mutations or polymorphisms coding for amino acid changes in canonical open reading frames (ORFs).
View Article and Find Full Text PDFBiol Lett
January 2025
Département de sciences biologiques, Université de Montréal, Montréal, QC, Canada.
Strict maternal inheritance of mitochondria is known to be the rule in animals, but over 100 species across six orders of bivalves possess doubly uniparental inheritance (DUI) of mitochondria. Under DUI, two distinctive sex-specific mitogenomes coexist. In marine and freshwater mussels, each mitogenome has an additional protein-coding gene, called female- and male-specific open reading frame or and , respectively.
View Article and Find Full Text PDFNAR Genom Bioinform
March 2025
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.
Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript, we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to help assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein-coding region.
View Article and Find Full Text PDFInt J Mol Sci
December 2024
School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China.
Long non-coding RNA (lncRNA) is a non-coding RNA longer than 200 nucleotides, crucial for functions like cell cycle regulation and gene transcription. Accurate localization prediction from sequence information is vital for understanding lncRNA's biological roles. Computational methods offer an effective alternative to traditional experimental methods for annotating lncRNA subcellular positions.
View Article and Find Full Text PDFInt J Mol Sci
December 2024
College of Life Science, Shaanxi Normal University, Xi'an 710119, China.
Functional divergences of coding genes can be caused by divergences in their coding sequences and expression. However, whether and how expression divergences and coding sequence divergences coevolve is not clear. Gene expression divergences in differentiated cells and tissues recapitulate developmental models within a species, while gene expression divergences between analogous cells and tissues resemble traditional phylogenies in different species, suggesting that gene expression divergences are molecular traits that can be used for evolutionary studies.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!