Massive data clustering by multi-scale psychological observations.

Natl Sci Rev

National Engineering Laboratory of Big Data Analytics, Xi'an Jiaotong University, Xi'an 710049, China.

Published: February 2022

Clustering is the discovery of latent group structure in data and is a fundamental problem in artificial intelligence, and a vital procedure in data-driven scientific research over all disciplines. Yet, existing methods have various limitations, especially weak cognitive interpretability and poor computational scalability, when it comes to clustering massive datasets that are increasingly available in all domains. Here, by simulating the multi-scale cognitive observation process of humans, we design a scalable algorithm to detect clusters hierarchically hidden in massive datasets. The observation scale changes, following the Weber-Fechner law to capture the gradually emerging meaningful grouping structure. We validated our approach in real datasets with up to a billion records and 2000 dimensions, including taxi trajectories, single-cell gene expressions, face images, computer logs and audios. Our approach outperformed popular methods in usability, efficiency, effectiveness and robustness across different domains.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8889001PMC
http://dx.doi.org/10.1093/nsr/nwab183DOI Listing

Publication Analysis

Top Keywords

massive datasets
8
massive data
4
data clustering
4
clustering multi-scale
4
multi-scale psychological
4
psychological observations
4
observations clustering
4
clustering discovery
4
discovery latent
4
latent group
4

Similar Publications

The use of single-cell combinatorial indexing sequencing via droplet microfluidics presents an attractive approach for balancing cost, scalability, robustness and accessibility. However, existing methods often require tailored protocols for individual modalities, limiting their automation potential and clinical applicability. To address this, we introduce UDA-seq, a universal workflow that integrates a post-indexing step to enhance throughput and systematically adapt existing droplet-based single-cell multimodal methods.

View Article and Find Full Text PDF

Protocol for genetic analysis of population-scale ultra-low-depth sequencing data.

STAR Protoc

January 2025

BGI Research, Shenzhen 518083, China; Shenzhen Key Laboratory of Transomics Biotechnologies, BGI Research, Shenzhen 518083, China. Electronic address:

Non-invasive prenatal testing (NIPT) not only enables the detection of chromosomal anomalies in fetuses but also generates vast amounts of ultra-low-depth sequencing data, which can be leveraged for population genomic studies. Here, we present a protocol designed for massive ultra-low-depth sequencing datasets. We detail the steps for data processing, quality control, and genotype imputation, followed by genome-wide association study (GWAS) and post-GWAS analyses.

View Article and Find Full Text PDF

Genome-wide association studies (GWAS) of melanoma risk have identified 68 independent signals at 54 loci. For most loci, specific functional variants and their respective target genes remain to be established. Capture-HiC is an assay that links fine-mapped risk variants to candidate target genes by comprehensively mapping cell-type specific chromatin interactions.

View Article and Find Full Text PDF

PbsNRs: predict the potential binders and scaffolds for nuclear receptors.

Brief Bioinform

November 2024

Institute of Clinical Science, Zhongshan Hospital, Shanghai Medical College, Shanghai Institute of Infectious Disease and Biosecurity, Intelligent Medicine Institute, School of Life Sciences, Fudan University, No. 180 Fenglin Road, Shanghai 200032, China.

Nuclear receptors (NRs) are a class of essential proteins that regulate the expression of specific genes and are associated with multiple diseases. In silico methods for prescreening potential NR binders with predictive binding ability are highly desired for NR-related drug development but are rarely reported. Here, we present the PbsNRs (Predicting binders and scaffolds for Nuclear Receptors), a user-friendly web server designed to predict the potential NR binders and scaffolds through proteochemometric modeling.

View Article and Find Full Text PDF

EM-AUC: A Novel Algorithm for Evaluating Anomaly Based Network Intrusion Detection Systems.

Sensors (Basel)

December 2024

Department of Engineering Management and Systems Engineering, George Washington University, Washington, DC 20052, USA.

Effective network intrusion detection using anomaly scores from unsupervised machine learning models depends on the performance of the models. Although unsupervised models do not require labels during the training and testing phases, the assessment of their performance metrics during the evaluation phase still requires comparing anomaly scores against labels. In real-world scenarios, the absence of labels in massive network datasets makes it infeasible to calculate performance metrics.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!