Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task.

PLoS Comput Biol

School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America.

Published: October 2023

Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10597526PMC
http://dx.doi.org/10.1371/journal.pcbi.1011526DOI Listing

Publication Analysis

Top Keywords

protein-coding potential
12
translation task
8
sequence patterns
8
sequence features
8
distinguishing mrnas
8
mrnas lncrnas
8
neural networks
8
improves classification
8
translation
5
improving deep
4

Similar Publications

Idiopathic pulmonary fibrosis (IPF) is a rapidly progressive interstitial lung disease of unknown pathogenesis with no effective treatment currently available. Given the regulatory roles of lncRNAs (TP53TG1, LINC00342, H19, MALAT1, DNM3OS, MEG3), miRNAs (miR-218-5p, miR-126-3p, miR-200a-3p, miR-18a-5p, miR-29a-3p), and their target protein-coding genes (PTEN, TGFB2, FOXO3, KEAP1) in the TGF-β/SMAD3, Wnt/β-catenin, focal adhesion, and PI3K/AKT signaling pathways, we investigated the expression levels of selected genes in peripheral blood mononuclear cells (PBMCs) and lung tissue from patients with IPF. Lung tissue and blood samples were collected from 33 newly diagnosed, treatment-naive patients and 70 healthy controls.

View Article and Find Full Text PDF

Alzheimer's disease (AD), a progressive neurodegenerative disorder, is frequently associated with musculoskeletal complications, including sarcopenia and osteoporosis, which substantially impair patient quality of life. Despite these clinical observations, the molecular mechanisms linking AD to bone loss remain insufficiently explored. In this study, we examined the femoral bone microarchitecture and transcriptomic profiles of APP/PS1 transgenic mouse models of AD to elucidate the disease's impact on bone pathology and identify potential gene candidates associated with bone deterioration.

View Article and Find Full Text PDF

Organismal complexity strongly correlates with the number of protein families and domains.

Proc Natl Acad Sci U S A

February 2025

Duncan and Nancy MacMillan Cancer Immunology and Metabolism Center of Excellence, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ 08901.

In the pregenomic era, scientists were puzzled by the observation that haploid genome size (the C-value) did not correlate well with organismal complexity. This phenomenon, called the "C-value paradox," is mostly explained by the fact that protein-coding genes occupy only a small fraction of eukaryotic genomes. When the first genome sequences became available, scientists were even more surprised by the fact that the number of genes (G-value) was also a poor predictor of complexity, which gave rise to the "G-value paradox.

View Article and Find Full Text PDF

MAI-TargetFisher: A proteome-wide drug target prediction method synergetically enhanced by artificial intelligence and physical modeling.

Acta Pharmacol Sin

January 2025

Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, 201210, China.

Computational target identification plays a pivotal role in the drug development process. With the significant advancements of deep learning methods for protein structure prediction, the structural coverage of human proteome has increased substantially. This progress inspired the development of the first genome-wide small molecule targets scanning method.

View Article and Find Full Text PDF

The complete mitochondrial genomes of and .

Mitochondrial DNA B Resour

January 2025

Department of Wildlife, Fisheries and Aquaculture, Mississippi State University, Mississippi State, Mississippi, USA.

We present a novel mitogenome assembly of the Redlip Shiner, , and assemblies for the Greenhead Shiner, (Cypriniformes: Leuciscidae). Both are charismatic minnows in the taxonomic group and are endemic to the eastern United States. The genome contains 16,711bp and 16,706bp each comprising a total of 13 protein coding genes, 22 tRNAs, two rRNAs, and a control region.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!