Publications by authors named "Jurgen Bajorath"

Here, we present a protocol to generate dual-target compounds (DT-CPDs) interacting with two distinct target proteins using a transformer-based chemical language model. We describe steps for installing software, preparing data, and pre-training the model on pairs of single-target compounds (ST-CPDs), which bind to an individual protein, and DT-CPDs. We then detail procedures for assembling ST- and corresponding DT-CPD data for specific protein pairs and evaluating the model's performance on hold-out test sets.

View Article and Find Full Text PDF

Explaining the predictions of machine learning models is of critical importance for integrating predictive modeling in drug discovery projects. We have generated a test system for predicting isoform selectivity of phosphoinositide 3-kinase (PI3K) inhibitors and systematically analyzed correct predictions of selective inhibitors using a new methodology termed MolAnchor, which is based on the "anchors" concept from explainable artificial intelligence. The approach is designed to generate chemically intuitive explanations of compound predictions.

View Article and Find Full Text PDF

Analogue series (AS) are generated during compound optimization in medicinal chemistry and are the major source of structure-activity relationship (SAR) information. Pairs of active AS consisting of compounds with corresponding substituents and comparable potency progression represent SAR transfer events for the same target or across different targets. We report a new computational approach to systematically search for SAR transfer series that combines an AS alignment algorithm with context-depending similarity assessment based on vector embeddings adapted from natural language processing.

View Article and Find Full Text PDF

While data curation principles and practices are a major topic in data science, they are often not explicitly considered in machine learning (ML) applications in chemistry. We have been interested in evaluating the potential effects of data curation on the performance of molecular ML models. Therefore, a sequential curation scheme was developed for compounds and activity data, and different ML classification models were generated at increasing data confidence levels and evaluated.

View Article and Find Full Text PDF
Article Synopsis
  • Compound optimization in medicinal chemistry involves creating series of analogues to study structure-activity relationships (SARs), with a focus on improving potency.* -
  • A new computational method integrates a transformer chemical language model (CLM) with a SAR matrix (SARM) to generate potent analogues with modifications at various sites.* -
  • This methodology demonstrated its effectiveness by accurately predicting known potent compounds and producing diverse series through structural and substituent adjustments.*
View Article and Find Full Text PDF

The Shapley value formalism from cooperative game theory was adapted to explain predictions of machine learning models. Here, we present a protocol to calculate and compare exact Shapley values for support vector machine models with commonly used kernels and binary input features. We describe steps for installing software, preparing data, and calculating Shapley values with customizable Python scripts.

View Article and Find Full Text PDF

Over the past ~ 25 years, chemoinformatics has evolved as a scientific discipline, with a strong foundation in pharmaceutical research and scientific roots that can be traced back to the late 1950s. It covers a wide methodological spectrum and is perhaps best positioned in the greater context of chemical information science. Herein, the chemoinformatics discipline is delineated, characteristic (and partly problematic) features are discussed, and a global view of the field is provided, emphasizing key developments.

View Article and Find Full Text PDF

In drug discovery, human protein kinases (PKs) represent one of the major target classes due to their central role in cellular signaling, implication in various diseases as a consequence of deregulated signaling, and notable druggability. Individual PKs and their disease biology have been explored to different degrees, giving rise to heterogeneous functional knowledge and disease associations across the human kinome. The U.

View Article and Find Full Text PDF

Given their central role in signal transduction, protein kinases (PKs) were first implicated in cancer development, caused by aberrant intracellular signaling events. Since then, PKs have become major targets in different therapeutic areas. The preferred approach to therapeutic intervention of PK-dependent diseases is the use of small molecules to inhibit their catalytic phosphate group transfer activity.

View Article and Find Full Text PDF

Generating potent compounds for evolving analogue series (AS) is a key challenge in medicinal chemistry. The versatility of chemical language models (CLMs) makes it possible to formulate this challenge as an off-the-beaten-path prediction task. In this work, we have devised a coding and tokenization scheme for evolving AS with multiple substitution sites (multi-site AS) and implemented a bidirectional transformer to predict new potent analogues for such series.

View Article and Find Full Text PDF

The growing number of scientific papers and document sources underscores the need for methods capable of evaluating the quality of publications. Researchers who are looking for relevant papers for their studies need ways to assess the scientific value of these documents. One approach involves using semantic search engines that can automatically extract important knowledge from the growing body of text.

View Article and Find Full Text PDF

Most machine learning (ML) methods produce predictions that are hard or impossible to understand. The black box nature of predictive models obscures potential learning bias and makes it difficult to recognize and trace problems. Moreover, the inability to rationalize model decisions causes reluctance to accept predictions for experimental design.

View Article and Find Full Text PDF
Article Synopsis
  • - This research explores using deep learning models, originally from natural language processing, to predict active compounds by translating sequential molecular data, focusing on chemical language models for compound transformations.
  • - A unique dual-component language model was created that combines a protein language model to generate sequence embeddings and a conditional transformer to predict new active compounds based on desired potency values.
  • - The model showed success by reproducing known compounds with various potencies and generated a diverse array of candidate compounds, suggesting its potential for practical applications in compound design and development.
View Article and Find Full Text PDF

The continued growth of data from biological screening and medicinal chemistry provides opportunities for data-driven experimental design and decision making in early-phase drug discovery. Approaches adopted from data science help to integrate internal and public domain data and extract knowledge from historical in-house data. Protein kinase (PK) drug discovery is an exemplary area where large amounts of data are accumulating, providing a valuable knowledge base for discovery projects.

View Article and Find Full Text PDF

Shapley values from cooperative game theory are adapted for explaining machine learning predictions. For large feature sets used in machine learning, Shapley values are approximated. We present a protocol for two techniques for explaining support vector machine predictions with exact Shapley value computation.

View Article and Find Full Text PDF

Protein kinases (PKs) are involved in many intracellular signal transduction pathways through phosphorylation cascades and have become intensely investigated pharmaceutical targets over the past two decades. Inhibition of PKs using small-molecular inhibitors is a premier strategy for the treatment of diseases in different therapeutic areas that are caused by uncontrolled PK-mediated phosphorylation and aberrant signaling. Most PK inhibitors (PKIs) are directed against the ATP cofactor binding site that is largely conserved across the human kinome comprising 518 wild-type PKs (and many mutant forms).

View Article and Find Full Text PDF

The assessment of prediction variance or uncertainty contributes to the evaluation of machine learning models. In molecular machine learning, uncertainty quantification is an evolving area of research where currently no standard approaches or general guidelines are available. We have carried out a detailed analysis of deep neural network variants and simple control models for compound potency prediction to study relationships between prediction accuracy and uncertainty.

View Article and Find Full Text PDF

In drug discovery, chemical language models (CLMs) originating from natural language processing offer new opportunities for molecular design. CLMs have been developed using recurrent neural network (RNN) or transformer architectures. For the predictive performance of RNN-based encoder-decoder frameworks and transformers, attention mechanisms play a central role.

View Article and Find Full Text PDF

The use of black box machine learning models whose decisions cannot be understood limits the acceptance of predictions in interdisciplinary research and camouflages artificial learning characteristics leading to predictions for other than anticipated reasons. Consequently, there is increasing interest in explainable artificial intelligence to rationalize predictions and uncover potential pitfalls. Among others, relevant approaches include feature attribution methods to identify molecular structures determining predictions and counterfactuals (CFs) or contrastive explanations.

View Article and Find Full Text PDF

Machine learning (ML) algorithms are extensively used in pharmaceutical research. Most ML models have black-box character, thus preventing the interpretation of predictions. However, rationalizing model decisions is of critical importance if predictions should aid in experimental design.

View Article and Find Full Text PDF

Potency predictions are popular in compound design and optimization but are complicated by intrinsic limitations. Moreover, even for nonlinear methods, activity cliffs (ACs, formed by structural analogues with large potency differences) represent challenging test cases for compound potency predictions. We have devised a new test system for potency predictions, including AC compounds, that is based on partitioned matched molecular pairs (MMP) and makes it possible to monitor prediction accuracy at the level of analogue pairs with increasing potency differences.

View Article and Find Full Text PDF

Compound potency predictions play a major role in computational drug discovery. Predictive methods are typically evaluated and compared in benchmark calculations that are widely applied. Previous studies have revealed intrinsic limitations of potency prediction benchmarks including very similar performance of increasingly complex machine learning methods and simple controls and narrow error margins separating machine learning from randomized predictions.

View Article and Find Full Text PDF

Aim: Generation of high-quality data sets of protein kinase inhibitors (PKIs).

Methodology: Publicly available PKIs with reliable activity data were curated. PKIs with very weak activity were classified as inactive.

View Article and Find Full Text PDF

For many machine learning applications in drug discovery, only limited amounts of training data are available. This typically applies to compound design and activity prediction and often restricts machine learning, especially deep learning. For low-data applications, specialized learning strategies can be considered to limit required training data.

View Article and Find Full Text PDF