The classification of molecules is of particular importance to the drug discovery process and several other use cases. Data in this domain can be partitioned into structural and sequence/text data. Several techniques such as deep learning are able to classify molecules and predict their functions using both types of data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of a molecule. However, the use of a molecule's structural information typically requires large amounts of computational power with deep learning models that take a long time to train. In this study, we present an alternative approach to molecule classification that addresses the limitations of other techniques. This approach uses natural language processing techniques in the form of count vectorisation, term frequency-inverse document frequency, word2vec and Latent Dirichlet Allocation to feature engineer molecular text data. Through this approach, we aim to make a robust and easily reproducible embedding that is fast to implement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we investigate the usefulness of these embeddings for machine learning models. We apply the techniques to two different types of molecular text data: FASTA sequence data and Simplified Molecular Input Line Entry Specification data. We show that these embeddings provide excellent performance for classification.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.compbiolchem.2024.108056DOI Listing

Publication Analysis

Top Keywords

text data
12
data
9
deep learning
8
learning models
8
molecular text
8
molecular
5
feature engineered
4
engineered embeddings
4
classification
4
embeddings classification
4

Similar Publications

Background And Objective: Scabies is the second most common cause of disability due to skin disease in the Philippines. However, there were no cited studies in Global Burden of Disease 2019 and the disability-adjusted life years (DALY) computations were most likely based on statistical modelling. The Philippine Department of Health has embarked on a program to estimate the disease burden of priority diseases in the country, which include scabies.

View Article and Find Full Text PDF

The analysis of social networks enables the understanding of social interactions, polarization of ideas and the spread of information, and therefore plays an important role in society. We use Twitter data-as it is a popular venue for the expression of opinion and dissemination of information-to identify opposing sides of a debate and, importantly, to observe how information spreads between these groups in our current polarized climate. To achieve this, we collected over 688 000 tweets from the Irish Abortion Referendum of 2018 to build a conversation network from users' mentions with sentiment-based homophily.

View Article and Find Full Text PDF

This study aimed to develop an advanced ensemble approach for automated classification of mental health disorders in social media posts. The research question was: can an ensemble of fine-tuned transformer models (XLNet, RoBERTa, and ELECTRA) with Bayesian hyperparameter optimization improve the accuracy of mental health disorder classification in social media text. Three transformer models (XLNet, RoBERTa, and ELECTRA) were fine-tuned on a dataset of social media posts labelled with 15 distinct mental health disorders.

View Article and Find Full Text PDF

Granulomatous mastitis (GM) is a chronic inflammatory breast condition that presents significant diagnostic challenges due to its clinical and imaging similarities to malignancies. Accurate diagnosis is crucial to avoid unnecessary interventions and ensure effective management. A total of 1,216 articles were initially identified through a comprehensive database search.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!