Data sets with imbalanced class sizes, where one class size is much smaller than that of others, occur exceedingly often in many applications, including those with biological foundations, such as disease diagnosis and drug discovery. Therefore, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to do so can result in heavy costs. Nonetheless, many data classification procedures do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed method is validated using six molecular data sets and compared to other related techniques. The computational experiments show that the proposed technique is superior to competing approaches even in the case of a high class imbalance ratio.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11213158 | PMC |
Brief Bioinform
November 2024
Department of Biology, University of Padova, Via U.Bassi 58/ B, 35131, Italy.
Shallow whole-genome sequencing (sWGS) offers a cost-effective approach to detect copy number alterations (CNAs). However, there remains a gap for a standardized workflow specifically designed for sWGS analysis. To address this need, in this work we present SAMURAI, a bioinformatics pipeline specifically designed for analyzing CNAs from sWGS data in a standardized and reproducible manner.
View Article and Find Full Text PDFPLoS One
January 2025
Department of Psychiatry, University of California San Diego, La Jolla, CA, United States of America.
Background: Bipolar Disorder (BD) is a complex disease. It is heterogeneous, both at the phenotypic and genetic level, although the extent and impact of this heterogeneity is not fully understood. One way to assess this heterogeneity is to look for patterns in the subphenotype data.
View Article and Find Full Text PDFmSystems
January 2025
Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland.
Average nucleotide identity (ANI) is a widely used metric to estimate genetic relatedness, especially in microbial species delineation. While ANI calculation has been well optimized for bacteria and closely related viral genomes, accurate estimation of ANI below 80%, particularly in large reference data sets, has been challenging due to a lack of accurate and scalable methods. To bridge this gap, we introduce MANIAC, an efficient computational pipeline optimized for estimating ANI and alignment fraction (AF) in viral genomes with divergence around ANI of 70%.
View Article and Find Full Text PDFJ Chem Inf Model
January 2025
Department of Grain Science and Industry, Kansas State University, Manhattan, Kansas 66506, United States.
Cell-penetrating peptides (CPPs) are short peptides capable of penetrating cell membranes, making them valuable for drug delivery and intracellular targeting. Accurate prediction of CPPs can streamline experimental validation in the lab. This study aims to assess pretrained protein language models (pLMs) for their effectiveness in representing CPPs and develop a reliable model for CPP classification.
View Article and Find Full Text PDFJ Chem Inf Model
January 2025
School of Information Science & Engineering, Lanzhou University, Lanzhou 730000, China.
Efficient and accurate drug-target affinity (DTA) prediction can significantly accelerate the drug development process. Recently, deep learning models have been widely applied to DTA prediction and have achieved notable success. However, existing methods often encounter several common issues: first, the data representations lack sufficient information; second, the extracted features are not comprehensive; and third, most methods lack interpretability when modeling drug-target binding.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!