Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular Data.

Nicole Hayes Ekaterina Merkurjev Guo-Wei Wei

ArXiv

Published: September 2024

Data sets with imbalanced class sizes, where one class size is much smaller than that of others, occur exceedingly often in many applications, including those with biological foundations, such as disease diagnosis and drug discovery. Therefore, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to do so can result in heavy costs. Nonetheless, many data classification procedures do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed method is validated using six molecular data sets and compared to other related techniques. The computational experiments show that the proposed technique is superior to competing approaches even in the case of a high class imbalance ratio.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11213158	PMC

Publication Analysis

Top Keywords

data sets

bidirectional transformer

molecular data

decision threshold

data

data classification

distance correlation

proposed technique

mbo algorithm

class imbalance

Similar Publications

SAMURAI: shallow analysis of copy number alterations using a reproducible and integrated bioinformatics pipeline.

Brief Bioinform

November 2024

Department of Biology, University of Padova, Via U.Bassi 58/ B, 35131, Italy.

Sara Potente Diego Boscarino Dino Paladin Sergio Marchini Luca Beltrame

Shallow whole-genome sequencing (sWGS) offers a cost-effective approach to detect copy number alterations (CNAs). However, there remains a gap for a standardized workflow specifically designed for sWGS analysis. To address this need, in this work we present SAMURAI, a bioinformatics pipeline specifically designed for analyzing CNAs from sWGS data in a standardized and reproducible manner.

View Article and Find Full Text PDF

Similar Publications

Heterogeneity analysis provides evidence for a genetically homogeneous subtype of bipolar-disorder.

PLoS One

January 2025

Department of Psychiatry, University of California San Diego, La Jolla, CA, United States of America.

Caroline C McGrouther Aaditya V Rangan Arianna Di Florio Jeremy A Elman Nicholas J Schork

Background: Bipolar Disorder (BD) is a complex disease. It is heterogeneous, both at the phenotypic and genetic level, although the extent and impact of this heterogeneity is not fully understood. One way to assess this heterogeneity is to look for patterns in the subphenotype data.

View Article and Find Full Text PDF

Similar Publications

Exploration of the genetic landscape of bacterial dsDNA viruses reveals an ANI gap amid extensive mosaicism.

mSystems

January 2025

Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland.

Wanangwa Ndovie Jan Havránek Jade Leconte Janusz Koszucki Leonid Chindelevitch

Average nucleotide identity (ANI) is a widely used metric to estimate genetic relatedness, especially in microbial species delineation. While ANI calculation has been well optimized for bacteria and closely related viral genomes, accurate estimation of ANI below 80%, particularly in large reference data sets, has been challenging due to a lack of accurate and scalable methods. To bridge this gap, we introduce MANIAC, an efficient computational pipeline optimized for estimating ANI and alignment fraction (AF) in viral genomes with divergence around ANI of 70%.

View Article and Find Full Text PDF

Similar Publications

pLM4CPPs: Protein Language Model-Based Predictor for Cell Penetrating Peptides.

J Chem Inf Model

January 2025

Department of Grain Science and Industry, Kansas State University, Manhattan, Kansas 66506, United States.

Nandan Kumar Zhenjiao Du Yonghui Li

Cell-penetrating peptides (CPPs) are short peptides capable of penetrating cell membranes, making them valuable for drug delivery and intracellular targeting. Accurate prediction of CPPs can streamline experimental validation in the lab. This study aims to assess pretrained protein language models (pLMs) for their effectiveness in representing CPPs and develop a reliable model for CPP classification.

View Article and Find Full Text PDF

Similar Publications

MutualDTA: An Interpretable Drug-Target Affinity Prediction Model Leveraging Pretrained Models and Mutual Attention.

J Chem Inf Model

January 2025

School of Information Science & Engineering, Lanzhou University, Lanzhou 730000, China.

Yongna Yuan Siming Chen Rizhen Hu Xin Wang

Efficient and accurate drug-target affinity (DTA) prediction can significantly accelerate the drug development process. Recently, deep learning models have been widely applied to DTA prediction and have achieved notable success. However, existing methods often encounter several common issues: first, the data representations lack sufficient information; second, the extracted features are not comprehensive; and third, most methods lack interpretability when modeling drug-target binding.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!