Improving ncRNA family prediction using multi-modal contrastive learning of sequence and structure.

Ruiting Xu Dan Li Wen Yang Guohua Wang Yang Li

Bioinformatics

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.

Published: November 2024

Recent advancements in high-throughput sequencing have led to increased interest in non-coding RNA (ncRNA) research, but many ncRNAs still lack clear functional understanding.
Traditional biological methods for predicting ncRNA families are resource-intensive, and existing computational methods don't fully utilize all available ncRNA data.
The proposed MM-ncRNAFP framework combines sequence data and secondary structure information through multi-modal contrastive learning, improving ncRNA family prediction performance significantly.

Motivation: Recent advancements in high-throughput sequencing technology have significantly increased the focus on non-coding RNA (ncRNA) research within the life sciences. Despite this, the functions of many ncRNAs remain poorly understood. Research suggests that ncRNAs within the same family typically share similar functions, underlining the importance of understanding their roles. There are two primary methods for predicting ncRNA families: biological and computational. Traditional biological methods are not suitable for large-scale data prediction due to the significant human and resource requirements. Concurrently, most existing computational methods either rely solely on ncRNA sequence data or are exclusively based on the secondary structure of ncRNA molecules. These methods fail to fully utilize the rich multimodal information available from ncRNAs, thereby preventing them from learning more comprehensive and in-depth feature representations.

Results: To tackle these problems, we proposed MM-ncRNAFP, a multi-modal contrastive learning framework for ncRNA family prediction. We first used a pre-trained language model to encode the primary sequences of a large mammalian ncRNA dataset. Then, we adopted a contrastive learning framework with an attention mechanism to fuse the secondary structure information obtained by graph neural networks. The MM-ncRNAFP method can effectively fuse multi-modal information. Experimental comparisons with several competitive baselines demonstrated that MM-ncRNAFP can achieve more comprehensive representations of ncRNA features by integrating both sequence and structural information. This integration significantly enhances the performance of ncRNA family prediction. Ablation experiments and qualitative analyses were performed to verify the effectiveness of each component in our model. Moreover, since our model is pre-trained on a large amount of ncRNA data, it has the potential to bring significant improvements to other ncRNA-related tasks.

Availability And Implementation: MM-ncRNAFP and the datasets are available at https://github.com/xuruiting2/MM-ncRNAFP.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11639665	PMC
http://dx.doi.org/10.1093/bioinformatics/btae640	DOI Listing

Publication Analysis

Top Keywords

ncrna family

family prediction

contrastive learning

ncrna

multi-modal contrastive

secondary structure

learning framework

improving ncrna

family

prediction

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!