Lysine glutarylation is a post-translational modification (PTM) that plays a regulatory role in various physiological and biological processes. Identifying glutarylated peptides using proteomic techniques is expensive and time-consuming. Therefore, developing computational models and predictors can prove useful for rapid identification of glutarylation. In this study, we propose a model called ProtTrans-Glutar to classify a protein sequence into positive or negative glutarylation site by combining traditional sequence-based features with features derived from a pre-trained transformer-based protein model. The features of the model were constructed by combining several feature sets, namely the distribution feature (from composition/transition/distribution encoding), enhanced amino acid composition (EAAC), and features derived from the ProtT5-XL-UniRef50 model. Combined with random under-sampling and XGBoost classification method, our model obtained recall, specificity, and AUC scores of 0.7864, 0.6286, and 0.7075 respectively on an independent test set. The recall and AUC scores were notably higher than those of the previous glutarylation prediction models using the same dataset. This high recall score suggests that our method has the potential to identify new glutarylation sites and facilitate further research on the glutarylation process.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9194472 | PMC |
http://dx.doi.org/10.3389/fgene.2022.885929 | DOI Listing |
Nucleic Acids Res
January 2025
London Institute for Mathematical Sciences Royal Institution, 21 Albemarle St, London W1S 4BS, UK.
Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA language model (GENA-LM), a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 000 base pairs.
View Article and Find Full Text PDFAppl Clin Inform
January 2025
Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany.
Objective: Commercially available large language models such as Chat Generative Pre-Trained Transformer (ChatGPT) cannot be applied to real patient data for data protection reasons. At the same time, de-identification of clinical unstructured data is a tedious and time-consuming task when done manually. Since transformer models can efficiently process and analyze large amounts of text data, our study aims to explore the impact of a large training dataset on the performance of this task.
View Article and Find Full Text PDFbioRxiv
December 2024
Simons Machine Learning Center, New York Structural Biology Center, New York, United States.
It is now possible to generate large volumes of high-quality images of biomolecules at near-atomic resolution and in near-native states using cryogenic electron microscopy/electron tomography (Cryo-EM/ET). However, the precise annotation of structures like filaments and membranes remains a major barrier towards applying these methods in high-throughput. To address this, we present TARDIS (ransformer-bsed apid imensionless nstance egmentation), a machine-learning framework for fast and accurate annotation of micrographs and tomograms.
View Article and Find Full Text PDFNeural Netw
December 2024
University of British Columbia, ICICS/CS Building 201-2366 Main Mall, Vancouver, BC, Canada.
Many Transformer-based pre-trained models for code have been developed and applied to code-related tasks. In this paper, we analyze 519 papers published on this topic during 2017-2023, examine the suitability of model architectures for different tasks, summarize their resource consumption, and look at the generalization ability of models on different datasets. We examine three representative pre-trained models for code: CodeBERT, CodeGPT, and CodeT5, and conduct experiments on the four topmost targeted software engineering tasks from the literature: Bug Fixing, Bug Detection, Code Summarization, and Code Search.
View Article and Find Full Text PDFBioData Min
December 2024
Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, 60200, Czech Republic.
Background: Long terminal repeats (LTRs) represent important parts of LTR retrotransposons and retroviruses found in high copy numbers in a majority of eukaryotic genomes. LTRs contain regulatory sequences essential for the life cycle of the retrotransposon. Previous experimental and sequence studies have provided only limited information about LTR structure and composition, mostly from model systems.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!