Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10597526 | PMC |
http://dx.doi.org/10.1371/journal.pcbi.1011526 | DOI Listing |
Biochem Genet
January 2025
Bashkir State Medical University, Lenina Str. 3, Ufa, 450008, Russian Federation.
Idiopathic pulmonary fibrosis (IPF) is a rapidly progressive interstitial lung disease of unknown pathogenesis with no effective treatment currently available. Given the regulatory roles of lncRNAs (TP53TG1, LINC00342, H19, MALAT1, DNM3OS, MEG3), miRNAs (miR-218-5p, miR-126-3p, miR-200a-3p, miR-18a-5p, miR-29a-3p), and their target protein-coding genes (PTEN, TGFB2, FOXO3, KEAP1) in the TGF-β/SMAD3, Wnt/β-catenin, focal adhesion, and PI3K/AKT signaling pathways, we investigated the expression levels of selected genes in peripheral blood mononuclear cells (PBMCs) and lung tissue from patients with IPF. Lung tissue and blood samples were collected from 33 newly diagnosed, treatment-naive patients and 70 healthy controls.
View Article and Find Full Text PDFGeroscience
January 2025
Division of Endocrinology, Department of Medicine, Augusta University, Augusta, GA, USA.
Alzheimer's disease (AD), a progressive neurodegenerative disorder, is frequently associated with musculoskeletal complications, including sarcopenia and osteoporosis, which substantially impair patient quality of life. Despite these clinical observations, the molecular mechanisms linking AD to bone loss remain insufficiently explored. In this study, we examined the femoral bone microarchitecture and transcriptomic profiles of APP/PS1 transgenic mouse models of AD to elucidate the disease's impact on bone pathology and identify potential gene candidates associated with bone deterioration.
View Article and Find Full Text PDFProc Natl Acad Sci U S A
February 2025
Duncan and Nancy MacMillan Cancer Immunology and Metabolism Center of Excellence, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ 08901.
In the pregenomic era, scientists were puzzled by the observation that haploid genome size (the C-value) did not correlate well with organismal complexity. This phenomenon, called the "C-value paradox," is mostly explained by the fact that protein-coding genes occupy only a small fraction of eukaryotic genomes. When the first genome sequences became available, scientists were even more surprised by the fact that the number of genes (G-value) was also a poor predictor of complexity, which gave rise to the "G-value paradox.
View Article and Find Full Text PDFActa Pharmacol Sin
January 2025
Shanghai Institute for Advanced Immunochemical Studies and School of Life Science and Technology, ShanghaiTech University, Shanghai, 201210, China.
Computational target identification plays a pivotal role in the drug development process. With the significant advancements of deep learning methods for protein structure prediction, the structural coverage of human proteome has increased substantially. This progress inspired the development of the first genome-wide small molecule targets scanning method.
View Article and Find Full Text PDFMitochondrial DNA B Resour
January 2025
Department of Wildlife, Fisheries and Aquaculture, Mississippi State University, Mississippi State, Mississippi, USA.
We present a novel mitogenome assembly of the Redlip Shiner, , and assemblies for the Greenhead Shiner, (Cypriniformes: Leuciscidae). Both are charismatic minnows in the taxonomic group and are endemic to the eastern United States. The genome contains 16,711bp and 16,706bp each comprising a total of 13 protein coding genes, 22 tRNAs, two rRNAs, and a control region.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!