ProtTrans-Glutar: Incorporating Features From Pre-trained Transformer-Based Models for Predicting Glutarylation Sites.

Fatma Indriani Kunti Robiatul Mahmudah Bedy Purnama Kenji Satou

Front Genet

Institute of Science and Engineering, Kanazawa University, Kanazawa, Japan.

Published: May 2022

Lysine glutarylation is a post-translational modification (PTM) that plays a regulatory role in various physiological and biological processes. Identifying glutarylated peptides using proteomic techniques is expensive and time-consuming. Therefore, developing computational models and predictors can prove useful for rapid identification of glutarylation. In this study, we propose a model called ProtTrans-Glutar to classify a protein sequence into positive or negative glutarylation site by combining traditional sequence-based features with features derived from a pre-trained transformer-based protein model. The features of the model were constructed by combining several feature sets, namely the distribution feature (from composition/transition/distribution encoding), enhanced amino acid composition (EAAC), and features derived from the ProtT5-XL-UniRef50 model. Combined with random under-sampling and XGBoost classification method, our model obtained recall, specificity, and AUC scores of 0.7864, 0.6286, and 0.7075 respectively on an independent test set. The recall and AUC scores were notably higher than those of the previous glutarylation prediction models using the same dataset. This high recall score suggests that our method has the potential to identify new glutarylation sites and facilitate further research on the glutarylation process.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9194472	PMC
http://dx.doi.org/10.3389/fgene.2022.885929	DOI Listing

Publication Analysis

Top Keywords

pre-trained transformer-based

glutarylation sites

features derived

auc scores

glutarylation

features

model

prottrans-glutar incorporating

incorporating features

features pre-trained

Similar Publications

GENA-LM: a family of open-source foundational DNA language models for long sequences.

Nucleic Acids Res

January 2025

London Institute for Mathematical Sciences Royal Institution, 21 Albemarle St, London W1S 4BS, UK.

Veniamin Fishman Yuri Kuratov Aleksei Shmelev Maxim Petrov Dmitry Penzar

Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA language model (GENA-LM), a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 000 base pairs.

View Article and Find Full Text PDF

Similar Publications

A Transformer-Based Pipeline for German Clinical Document De-Identification.

Appl Clin Inform

January 2025

Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany.

Kamyar Arzideh Giulia Baldini Philipp Winnekens Christoph M Friedrich Felix Nensa

Objective: Commercially available large language models such as Chat Generative Pre-Trained Transformer (ChatGPT) cannot be applied to real patient data for data protection reasons. At the same time, de-identification of clinical unstructured data is a tedious and time-consuming task when done manually. Since transformer models can efficiently process and analyze large amounts of text data, our study aims to explore the impact of a large training dataset on the performance of this task.

View Article and Find Full Text PDF

Similar Publications

Accurate and fast segmentation of filaments and membranes in micrographs and tomograms with TARDIS.

bioRxiv

December 2024

Simons Machine Learning Center, New York Structural Biology Center, New York, United States.

Robert Kiewisz Gunar Fabig Will Conway Jake Johnston Victor A Kostyuchenko

It is now possible to generate large volumes of high-quality images of biomolecules at near-atomic resolution and in near-native states using cryogenic electron microscopy/electron tomography (Cryo-EM/ET). However, the precise annotation of structures like filaments and membranes remains a major barrier towards applying these methods in high-throughput. To address this, we present TARDIS (ransformer-bsed apid imensionless nstance egmentation), a machine-learning framework for fast and accurate annotation of micrographs and tomograms.

View Article and Find Full Text PDF

Similar Publications

Promises and perils of using Transformer-based models for SE research.

Neural Netw

December 2024

University of British Columbia, ICICS/CS Building 201-2366 Main Mall, Vancouver, BC, Canada.

Yan Xiao Xinyue Zuo Xiaoyue Lu Jin Song Dong Xiaochun Cao

Many Transformer-based pre-trained models for code have been developed and applied to code-related tasks. In this paper, we analyze 519 papers published on this topic during 2017-2023, examine the suitability of model architectures for different tasks, summarize their resource consumption, and look at the generalization ability of models on different datasets. We examine three representative pre-trained models for code: CodeBERT, CodeGPT, and CodeT5, and conduct experiments on the four topmost targeted software engineering tasks from the literature: Bug Fixing, Bug Detection, Code Summarization, and Code Search.

View Article and Find Full Text PDF

Similar Publications

Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning.

BioData Min

December 2024

Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, 60200, Czech Republic.

Jakub Horvath Pavel Jedlicka Marie Kratka Zdenek Kubat Eduard Kejnovsky

Background: Long terminal repeats (LTRs) represent important parts of LTR retrotransposons and retroviruses found in high copy numbers in a majority of eukaryotic genomes. LTRs contain regulatory sequences essential for the life cycle of the retrotransposon. Previous experimental and sequence studies have provided only limited information about LTR structure and composition, mostly from model systems.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!