Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.

NAR Genom Bioinform

Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, 225 North Avenue NW, Atlanta, GA, 30332, USA.

Published: December 2024

Dimension reduction (DR or embedding) algorithms such as t-SNE and UMAP have many applications in big data visualization but remain slow for large datasets. Here, we further improve the UMAP-like algorithms by (i) combining several aspects of t-SNE and UMAP to create a new DR algorithm; (ii) replacing its rate-limiting step, the K-nearest neighbor graph (K-NNG), with a Hierarchical Navigable Small World (HNSW) graph; and (iii) extending the functionality to DNA/RNA sequence data by combining HNSW with locality sensitive hashing algorithms (e.g. MinHash) for distance estimations among sequences. We also provide additional features including computation of local intrinsic dimension and hubness, which can reflect structures and properties of the underlying data that strongly affect the K-NNG accuracy, and thus the quality of the resulting embeddings. Our library, called annembed, is implemented, and fully parallelized in Rust and shows competitive accuracy compared to the popular UMAP-like algorithms. Additionally, we showcase the usefulness and scalability of our library with three real-world examples: visualizing a large-scale microbial genomic database, visualizing single-cell RNA sequencing data and metagenomic contig (or population) binning. Therefore, annembed can facilitate DR for several tasks for biological data analysis where distance computation is expensive or when there are millions to billions of data points to process.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655291PMC
http://dx.doi.org/10.1093/nargab/lqae172DOI Listing

Publication Analysis

Top Keywords

neighbor graph
8
biological data
8
t-sne umap
8
umap-like algorithms
8
data
7
approximate nearest
4
nearest neighbor
4
graph fast
4
fast efficient
4
efficient embedding
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!