Scalable partitioning and exploration of chemical spaces using geometric hashing.

J Chem Inf Model

Department of Computational Biology, University of Southern California, Los Angeles, 90089, USA.

Published: April 2006

Virtual screening (VS) has become a preferred tool to augment high-throughput screening(1) and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm(2) called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.

Download full-text PDF

Source
http://dx.doi.org/10.1021/ci050403oDOI Listing

Publication Analysis

Top Keywords

lsh algorithm
12
chemical spaces
8
data mining
8
huge chemical
8
data sets
8
chemical space
8
chemical
7
data
6
scalable partitioning
4
partitioning exploration
4

Similar Publications

Strabismus is a common ophthalmological condition, and early diagnosis is crucial to preventing visual impairment and loss of stereopsis. However, traditional methods for diagnosing strabismus often rely on specialized ophthalmic equipment and trained personnel, limiting the widespread accessibility of strabismus diagnosis. Computer-aided strabismus diagnosis is an effective and widely used technology that assists clinicians in making clinical diagnoses and improving efficiency.

View Article and Find Full Text PDF

The role of artificial intelligence in disaster recovery.

J Bus Contin Emer Plan

January 2025

The LSH Group.

In an era marked by the increasing frequency of major natural and manmade disasters, the imperative for effective disaster recovery planning and response has never been more pronounced. As communities grapple with the aftermath of hurricanes, earthquakes, wildfires, pandemics and other crises and emergencies, the integration of innovation technologies has emerged as a beacon of hope for more resilient and efficient recovery efforts. Standing at the forefront of this technological revolution is artificial intelligence (AI) - a transformative force with the potential to revolutionise every facet of disaster recovery.

View Article and Find Full Text PDF

The analysis of the psoas muscle in morphological and functional imaging has proved to be an accurate approach to assess sarcopenia, i.e. a systemic loss of skeletal muscle mass and function that may be correlated to multifactorial etiological aspects.

View Article and Find Full Text PDF

Scalar field comparison is a fundamental task in scientific visualization. In topological data analysis, we compare topological descriptors of scalar fields-such as persistence diagrams and merge trees-because they provide succinct and robust abstract representations. Several similarity measures for topological descriptors seem to be both asymptotically and practically efficient with polynomial time algorithms, but they do not scale well when handling large-scale, time-varying scientific data and ensembles.

View Article and Find Full Text PDF

A General Neurologist's Practical Diagnostic Algorithm for Atypical Parkinsonian Disorders: A Consensus Statement.

Neurol Clin Pract

December 2024

Neuroscience Institute (MKB), The Queen's Medical Center; Medicine (MKB), University of Hawaii, John A Burns School of Medicine, Honolulu; Neurology (RD), University of Arkansas for Medical Sciences, Little Rock; Service de Neurologie (AD), Département de Médecine, Centre Hospitalier de l'Université de Montréal (CHUM), Montreal, Quebec, Canada; Neurology (IUH), University of Miami, FL; Neurology (LSH), Columbia University Irving Medical Center, New York; Neurology (GL), The University of Utah; Neurology (GL), George E. Wahlen Department of Veterans Affairs Medical Center, Salt Lake City, UT; Neurology (NRM), University of Florida, Gainesville; Neurology (LM-K), Brigham and Women Hospital and Harvard Medical School, Boston, MA; Neurology (ZM), Johns Hopkins University, Baltimore, MD; Cleveland Clinic Lou Ruvo Center for Brain Health (ZM), Las Vegas, NV; Neurology (FR-P), Medical University of South Carolina, Charleston; CurePSP (J. Shurer, KD, LIG), New York; Neurological Institute (J. Siddiqui), Cleveland Clinic, OH; Neurology (CCS), University of Michigan, Ann Arbor; Neurology (AMW), Massachusetts General Hospital and Harvard Medical School, Boston; and Neurology (LIG), Rutgers Robert Wood Johnson Medical School, New Brunswick, NJ.

Purpose Of Review: The most common four neurodegenerative atypical parkinsonian disorders (APDs) are progressive supranuclear palsy (PSP), multiple system atrophy (MSA), corticobasal syndrome (CBS), and dementia with Lewy bodies (DLB). Their formal diagnostic criteria often require subspecialty experience to implement as designed and all require excluding competing diagnoses without clearly specifying how to do that. Validated diagnostic criteria are not available at all for many of the other common APDs, including normal pressure hydrocephalus (NPH), vascular parkinsonism (VP), or drug-induced parkinsonism (DIP).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!