While data curation principles and practices are a major topic in data science, they are often not explicitly considered in machine learning (ML) applications in chemistry. We have been interested in evaluating the potential effects of data curation on the performance of molecular ML models. Therefore, a sequential curation scheme was developed for compounds and activity data, and different ML classification models were generated at increasing data confidence levels and evaluated.
View Article and Find Full Text PDFThe Shapley value formalism from cooperative game theory was adapted to explain predictions of machine learning models. Here, we present a protocol to calculate and compare exact Shapley values for support vector machine models with commonly used kernels and binary input features. We describe steps for installing software, preparing data, and calculating Shapley values with customizable Python scripts.
View Article and Find Full Text PDFOver the past ~ 25 years, chemoinformatics has evolved as a scientific discipline, with a strong foundation in pharmaceutical research and scientific roots that can be traced back to the late 1950s. It covers a wide methodological spectrum and is perhaps best positioned in the greater context of chemical information science. Herein, the chemoinformatics discipline is delineated, characteristic (and partly problematic) features are discussed, and a global view of the field is provided, emphasizing key developments.
View Article and Find Full Text PDFIn drug discovery, human protein kinases (PKs) represent one of the major target classes due to their central role in cellular signaling, implication in various diseases as a consequence of deregulated signaling, and notable druggability. Individual PKs and their disease biology have been explored to different degrees, giving rise to heterogeneous functional knowledge and disease associations across the human kinome. The U.
View Article and Find Full Text PDF