Unlabelled: There is an increasing body of work exploring the integration of random projection into algorithms for numerical linear algebra. The primary motivation is to reduce the overall computational cost of processing large datasets. A suitably chosen random projection can be used to embed the original dataset in a lower-dimensional space such that key properties of the original dataset are retained. These algorithms are often referred to as sketching algorithms, as the projected dataset can be used as a compressed representation of the full dataset. We show that random matrix theory, in particular the Tracy-Widom law, is useful for describing the operating characteristics of sketching algorithms in the tall-data regime when the sample size is much greater than the number of variables . Asymptotic large sample results are of particular interest as this is the regime where sketching is most useful for data compression. In particular, we develop asymptotic approximations for the success rate in generating random subspace embeddings and the convergence probability of iterative sketching algorithms. We test a number of sketching algorithms on real large high-dimensional datasets and find that the asymptotic expressions give accurate predictions of the empirical performance.

Supplementary Information: The online version contains supplementary material available at 10.1007/s11222-022-10148-5.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9852177PMC
http://dx.doi.org/10.1007/s11222-022-10148-5DOI Listing

Publication Analysis

Top Keywords

sketching algorithms
20
tracy-widom law
8
random projection
8
original dataset
8
algorithms
7
sketching
5
randomized sketching
4
algorithms tracy-widom
4
law unlabelled
4
unlabelled increasing
4

Similar Publications

Unlabelled: The reflexive translation of symbols in one chemical language to another defined genetics. Yet, the co-linearity of codons and amino acids is so commonplace an idea that few even ask how it arose. Readout is done by two distinct sets of proteins, called aminoacyl-tRNA synthetases (AARS).

View Article and Find Full Text PDF

Background: Patient experience data from social media offer patient-centered perspectives on disease, treatments, and health service delivery. Current guidelines typically rely on systematic reviews, while qualitative health studies are often seen as anecdotal and nongeneralizable. This study explores combining personal health experiences from multiple sources to create generalizable evidence.

View Article and Find Full Text PDF

Cyto-Safe: A Machine Learning Tool for Early Identification of Cytotoxic Compounds in Drug Discovery.

J Chem Inf Model

December 2024

Laboratory for Molecular Modeling and Drug Design (LabMol), Faculdade de Farmácia, Universidade Federal de Goiás, Goiânia, Goiás 74605-220, Brazil.

Cytotoxicity is essential in drug discovery, enabling early evaluation of toxic compounds during screenings to minimize toxicological risks. assays support high-throughput screening, allowing for efficient detection of toxic substances while considerably reducing the need for animal testing. Additionally, AI-based Quantitative Structure-Activity Relationship (AI-QSAR) models enhance early stage predictions by assessing the cytotoxic potential of molecular structures, which helps prioritize low-risk compounds for further validation.

View Article and Find Full Text PDF

Tensor ring rank determination using odd-dimensional unfolding.

Neural Netw

November 2024

School of Automation, Guangdong University of Technology, Guangzhou 510006, China; Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo 103-0027, Japan. Electronic address:

While tensor ring (TR) decomposition methods have been extensively studied, the determination of TR-ranks remains a challenging problem, with existing methods being typically sensitive to the determination of the starting rank (i.e., the first rank to be optimized).

View Article and Find Full Text PDF

Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled -mers, from two closely related sequences.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!