J Am Stat Assoc
December 2023
Test of independence is of fundamental importance in modern data analysis, with broad applications in variable selection, graphical models, and causal inference. When the data is high dimensional and the potential dependence signal is sparse, independence testing becomes very challenging without distributional or structural assumptions. In this paper, we propose a general framework for independence testing by first fitting a classifier that distinguishes the joint and product distributions, and then testing the significance of the fitted classifier.
View Article and Find Full Text PDFEnsemble methods such as bagging and random forests are ubiquitous in various fields, from finance to genomics. Despite their prevalence, the question of the efficient tuning of ensemble parameters has received relatively little attention. This paper introduces a cross-validation method, ECV (Extrapolated Cross-Validation), for tuning the ensemble and subsample sizes in randomized ensembles.
View Article and Find Full Text PDFNeuropsychiatric genome-wide association studies (GWASs), including those for autism spectrum disorder and schizophrenia, show strong enrichment for regulatory elements in the developing brain. However, prioritizing risk genes and mechanisms is challenging without a unified regulatory atlas. Across 672 diverse developing human brains, we identified 15,752 genes harboring gene, isoform, and/or splicing quantitative trait loci, mapping 3739 to cellular contexts.
View Article and Find Full Text PDFSingle-cell CRISPR screens (perturb-seq) link genetic perturbations to phenotypic changes in individual cells. The most fundamental task in perturb-seq analysis is to test for association between a perturbation and a count outcome, such as gene expression. We conduct the first-ever comprehensive benchmarking study of association testing methods for low multiplicity-of-infection (MOI) perturb-seq data, finding that existing methods produce excess false positives.
View Article and Find Full Text PDFOngoing climate change has increased temperatures and the frequency of droughts in many parts of the world, potentially intensifying the desiccation risk for insects. Because resisting desiccation becomes more difficult at higher temperatures and lower humidity, avoiding water loss is a key challenge facing terrestrial insects. However, few studies have examined the interactive effects of temperature and environmental humidity on desiccation resistance in insects.
View Article and Find Full Text PDFSingle-cell CRISPR screens (perturb-seq) link genetic perturbations to phenotypic changes in individual cells. The most fundamental task in perturb-seq analysis is to test for association between a perturbation and a count outcome, such as gene expression. We conduct the first-ever comprehensive benchmarking study of association testing methods for low multiplicity-of-infection (MOI) perturb-seq data, finding that existing methods produce excess false positives.
View Article and Find Full Text PDFCRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present considerable statistical challenges.
View Article and Find Full Text PDFBackground: Single-cell RNA-sequencing (scRNA) datasets are becoming increasingly popular in clinical and cohort studies, but there is a lack of methods to investigate differentially expressed (DE) genes among such datasets with numerous individuals. While numerous methods exist to find DE genes for scRNA data from limited individuals, differential-expression testing for large cohorts of case and control individuals using scRNA data poses unique challenges due to substantial effects of human variation, i.e.
View Article and Find Full Text PDFIn genomics studies, the investigation of gene relationships often brings important biological insights. Currently, the large heterogeneous datasets impose new challenges for statisticians because gene relationships are often local. They change from one sample point to another, may only exist in a subset of the sample, and can be nonlinear or even nonmonotone.
View Article and Find Full Text PDFPolygenic scores (PGSs) are quantitative metrics for predicting phenotypic values, such as human height or disease status. Some PGS methods require only summary statistics of a relevant genome-wide association study (GWAS) for their score. One such method is Lassosum, which inherits the model selection advantages of Lasso to select a meaningful subset of the GWAS single-nucleotide polymorphisms as predictors from their association statistics.
View Article and Find Full Text PDFUsing a representative survey with 1317 individuals and 12,815 moral decisions, we elicit Swedish citizens' preferences on how algorithms for self-driving cars should be programmed in cases of unavoidable harm to humans. Participants' choices in different dilemma situations (treatments) show that, at the margin, the average respondent values the lives of passengers and pedestrians equally when both groups are homogeneous and no group is to blame for the dilemma. In comparison, the respondent values the lives of passengers more when the pedestrians violate a social norm, and less when the pedestrians are children.
View Article and Find Full Text PDFBackground: Single-cell RNA-sequencing (scRNA) datasets are becoming increasingly popular in clinical and cohort studies, but there is a lack of methods to investigate differentially expressed (DE) genes among such datasets with numerous individuals. While numerous methods exist to find DE genes for scRNA data from limited individuals, differential-expression testing for large cohorts of case and control individuals using scRNA data poses unique challenges due to substantial effects of human variation, i.e.
View Article and Find Full Text PDFPolygenic scores (PGS) are quantitative metrics for predicting phenotypic values, such as human height or disease status. Some PGS methods require only summary statistics of a relevant genome-wide association study (GWAS) for their score. One such method is Lassosum, which inherits the model selection advantages of Lasso to select a meaningful subset of the GWAS single nucleotide polymorphisms as predictors from their association statistics.
View Article and Find Full Text PDFTens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects.
View Article and Find Full Text PDFTemperature is one of the most important environmental conditions affecting physiological processes in ectothermic organisms like ants. Yet, we often lack information on how certain physiological traits covary with temperature across time. Here, we test predictions on how one trait-lipid content-covaries with temperature using a conspicuous, ground-dwelling harvester ant.
View Article and Find Full Text PDFSparse principal component analysis is an important technique for simultaneous dimensionality reduction and variable selection with high-dimensional data. In this work we combine the unique geometric structure of the sparse principal component analysis problem with recent advances in convex optimization to develop novel gradient-based sparse principal component analysis algorithms. These algorithms enjoy the same global convergence guarantee as the original alternating direction method of multipliers, and can be more efficiently implemented with the rich toolbox developed for gradient methods from the deep learning literature.
View Article and Find Full Text PDFMost variants associated with complex traits and diseases identified by genome-wide association studies (GWAS) map to noncoding regions of the genome with unknown effects. Using ancestrally diverse, biobank-scale GWAS data, massively parallel CRISPR screens, and single-cell transcriptomic and proteomic sequencing, we discovered 124 -target genes of 91 noncoding blood trait GWAS loci. Using precise variant insertion through base editing, we connected specific variants with gene expression changes.
View Article and Find Full Text PDFBiol Psychiatry Cogn Neurosci Neuroimaging
August 2023
Background: Integrating multiple neuroimaging modalities to identify clusters of individuals and then associating these clusters with psychopathology is a promising approach for understanding neurobiological mechanisms that underlie psychopathology and the extent to which these features are associated with clinical symptoms.
Methods: We leveraged neuroimaging data from T1-weighted, diffusion-weighted, and resting-state functional magnetic resonance images from the Adolescent Brain Cognitive Development (ABCD) Study (N = 8035) and used similarity network fusion and spectral clustering to identify subgroups of participants. We examined neuroimaging measures as a function of clustering profiles using 1, 2, or 3 imaging modalities (i.
Genomic regulatory elements active in the developing human brain are notably enriched in genetic risk for neuropsychiatric disorders, including autism spectrum disorder (ASD), schizophrenia, and bipolar disorder. However, prioritizing the specific risk genes and candidate molecular mechanisms underlying these genetic enrichments has been hindered by the lack of a single unified large-scale gene regulatory atlas of human brain development. Here, we uniformly process and systematically characterize gene, isoform, and splicing quantitative trait loci (xQTLs) in 672 fetal brain samples from unique subjects across multiple ancestral populations.
View Article and Find Full Text PDFProc Natl Acad Sci U S A
December 2022
Recent advances in single-cell technologies enable joint profiling of multiple omics. These profiles can reveal the complex interplay of different regulatory layers in single cells; still, new challenges arise when integrating datasets with some features shared across experiments and others exclusive to a single source; combining information across these sources is called mosaic integration. The difficulties lie in imputing missing molecular layers to build a self-consistent atlas, finding a common latent space, and transferring learning to new data sources robustly.
View Article and Find Full Text PDFPosttranscriptional RNA modifications by adenosine-to-inosine (A-to-I) editing are abundant in the brain, yet elucidating functional sites remains challenging. To bridge this gap, we investigate spatiotemporal and genetically regulated A-to-I editing sites across prenatal and postnatal stages of human brain development. More than 10,000 spatiotemporally regulated A-to-I sites were identified that occur predominately in 3' UTRs and introns, as well as 37 sites that recode amino acids in protein coding regions with precise changes in editing levels across development.
View Article and Find Full Text PDFWestern corn rootworm, Diabrotica virgifera virgifera, is one of the most economically important crop pests in the world with estimates of damage and control approximating over $1 billion USD annually. Despite an abundance of research devoted to studying rootworm biology in the central Corn Belt of the United States, key aspects on their thermal ecology are still lacking. Here we address this knowledge gap by measuring critical thermal limits, knock-down resistance, and chill coma recovery.
View Article and Find Full Text PDFTesting the significance of predictors in a regression model is one of the most important topics in statistics. This problem is especially difficult without any parametric assumptions on the data. This paper aims to test the null hypothesis that given confounding variables , does not significantly contribute to the prediction of under the model-free setting, where and are possibly high dimensional.
View Article and Find Full Text PDF