Publications by authors named "Roger Sayle"

Identifying and purchasing new small molecules to test in biological assays are enabling for ligand discovery, but as purchasable chemical space continues to grow into the tens of billions based on inexpensive make-on-demand compounds, simply searching this space becomes a major challenge. We have therefore developed ZINC20, a new version of ZINC with two major new features: billions of new molecules and new methods to search them. As a fully enumerated database, ZINC can be searched precisely using explicit atomic-level graph-based methods, such as SmallWorld for similarity and Arthor for pattern and substructure search, as well as 3D methods such as docking.

View Article and Find Full Text PDF

The symbols for the new IUPAC elements named in November 2016 can introduce subtle ambiguities within cheminformatics software. The ambiguities are described and demonstrated by highlighting inconsistencies between software when handling existing element symbols.

View Article and Find Full Text PDF

Aim: The assumption in scaffold hopping is that changing the scaffold does not change the binding mode and the same structure-activity relationships (SARs) are seen for substituents decorating each scaffold. Results/methodology: We present the use of matched series analysis, an extension of matched molecular pair analysis, to automate the analysis of a project's data and detect the presence or absence of comparable SAR between chemical series.

Conclusion: The presence of SAR transfer can confirm the perceived binding mode overlay of different chemotypes or suggest new arrangements between scaffolds that may have gone unnoticed.

View Article and Find Full Text PDF

Background: The concept of molecular similarity is one of the central ideas in cheminformatics, despite the fact that it is ill-defined and rather difficult to assess objectively. Here we propose a practical definition of molecular similarity in the context of drug discovery: molecules A and B are similar if a medicinal chemist would be likely to synthesise and test them around the same time as part of the same medicinal chemistry program. The attraction of such a definition is that it matches one of the key uses of similarity measures in early-stage drug discovery.

View Article and Find Full Text PDF

Awareness of the adverse effects of chemicals is important in biomedical research and healthcare. Text mining can allow timely and low-cost extraction of this knowledge from the biomedical literature. We extended our text mining solution, LeadMine, to identify diseases and chemical-induced disease relationships (CIDs).

View Article and Find Full Text PDF

Multiple recent studies have focused on unraveling the content of the medicinal chemist's toolbox. Here, we present an investigation of chemical reactions and molecules retrieved from U.S.

View Article and Find Full Text PDF

Finding a canonical ordering of the atoms in a molecule is a prerequisite for generating a unique representation of the molecule. The canonicalization of a molecule is usually accomplished by applying some sort of graph relaxation algorithm, the most common of which is the Morgan algorithm. There are known issues with that algorithm that lead to noncanonical atom orderings as well as problems when it is applied to large molecules like proteins.

View Article and Find Full Text PDF

Background: Chemical entity recognition has traditionally been performed by machine learning approaches. Here we describe an approach using grammars and dictionaries. This approach has the advantage that the entities found can be directly related to a given grammar or dictionary, which allows the type of an entity to be known and, if an entity is misannotated, indicates which resource should be corrected.

View Article and Find Full Text PDF

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents.

View Article and Find Full Text PDF

Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties.

View Article and Find Full Text PDF

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources.

View Article and Find Full Text PDF

A matched molecular series is the general form of a matched molecular pair and refers to a set of two or more molecules with the same scaffold but different R groups at the same position. We describe Matsy, a knowledge-based method that uses matched series to predict R groups likely to improve activity given an observed activity order for some R groups. We compare the Matsy predictions based on activity data from ChEMBLdb to the recommendations of the Topliss tree and carry out a large scale retrospective test to measure performance.

View Article and Find Full Text PDF

In protein crystallization, as well as in many other fields, it is known that the pH at which experiments are performed is often the key factor in the success or failure of the trials. With the trend towards plate-based high-throughput experimental techniques, measuring the pH values of solutions one by one becomes prohibitively time- and reagent-expensive. As part of an HT crystallization facility, a colour-based pH assay that is rapid, uses very little reagent and is suitable for 96-well or higher density plates has been developed.

View Article and Find Full Text PDF

When crystallization screening is conducted many outcomes are observed but typically the only trial recorded in the literature is the condition that yielded the crystal(s) used for subsequent diffraction studies. The initial hit that was optimized and the results of all the other trials are lost. These missing results contain information that would be useful for an improved general understanding of crystallization.

View Article and Find Full Text PDF

The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc.

View Article and Find Full Text PDF

It appears so simple at first glance, "tautomers are isomers of organic compounds that readily interconvert, usually by the migration of hydrogen from one atom to another". If a chemist can describe the problem so succinctly, one might question why the complication of tautomerism remains a considerable challenge to cheminformatics and computer-assisted drug design. With a half-century of experience with representing molecules in computers, and almost limitless modern computational power, the problem should have been solved by now.

View Article and Find Full Text PDF

Chemical compound names remain the primary method for conveying molecular structures between chemists and researchers. In research articles, patents, chemical catalogues, government legislation, and textbooks, the use of IUPAC and traditional compound names is universal, despite efforts to introduce more machine-friendly representations such as identifiers and line notations. Fortunately, advances in computing power now allow chemical names to be parsed and generated (read and written) with almost the same ease as conventional connection tables.

View Article and Find Full Text PDF

We apply a recently published method of text-based molecular similarity searching (LINGO) to standard data sets for the purpose of quantifying the accuracy of the approach. Our implementation is based on a pattern-matching finite state machine (FSM) which results in fast search times. The accuracy of LINGO is demonstrated to be comparable to that of a path-based fingerprint and offers a simple yet effective method for similarity searching.

View Article and Find Full Text PDF

A method is presented for enumerating a large number of isosteric analogues of a ligand from a known protein-ligand complex structure and then rapidly calculating an estimate of their binding energies. This approach takes full advantage of the observed crystal structure, by reusing the atomic co-ordinates determined experimentally for one ligand, to approximate those of similar compounds that have approximately the same shape. By assuming that compounds with similar shapes adopt similar binding poses, and that entropic and protein flexibility effects are approximately constant across such an isosteric series ("the frozen ligand approximation"), it is possible to order their binding affinities relatively accurately.

View Article and Find Full Text PDF