Publications by Peter Bickel | LitMetric

Publications by authors named "Peter Bickel"

Page 1 of 2

Dissecting gene expression heterogeneity: generalized Pearson correlation squares and the -lines clustering algorithm.

Jingyi Jessica Li Heather J Zhou Peter J Bickel Xin Tong

J Am Stat Assoc

May 2024

Motivated by the pressing needs for dissecting heterogeneous relationships in gene expression data, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued variables, with or without an index variable that specifies the line memberships. We construct the generalized Pearson correlation squares by focusing on three aspects: variable exchangeability, no parametric model assumptions, and inference of population-level parameters. To compute the generalized Pearson correlation square from a sample without a line-membership specification, we develop a -lines clustering algorithm to find clusters that exhibit distinct linear dependences, where can be chosen in a data-adaptive way.

View Article and Find Full Text PDF

Two provably consistent divide-and-conquer clustering algorithms for large networks.

Soumendu Sundar Mukherjee Purnamrita Sarkar Peter J Bickel

Proc Natl Acad Sci U S A

November 2021

In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms that perform clustering on several small subgraphs and finally patch the results into a single clustering. The main advantage of these algorithms is that they significantly bring down the computational cost of traditional algorithms, including spectral clustering, semidefinite programs, modularity-based methods, likelihood-based methods, etc.

View Article and Find Full Text PDF

David Blackwell, 1919-2010: An explorer in mathematics and statistics.

Proc Natl Acad Sci U S A

November 2020

View Article and Find Full Text PDF

NETWORK MODELLING OF TOPOLOGICAL DOMAINS USING HI-C DATA.

Y X Rachel Wang Purnamrita Sarkar Oana Ursu Anshul Kundaje Peter J Bickel

Ann Appl Stat

September 2019

Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs.

View Article and Find Full Text PDF

GeneFishing to reconstruct context specific portraits of biological processes.

Ke Liu Elizabeth Theusch Yun Zhou Tal Ashuach Andrea C Dose Peter J Bickel

Proc Natl Acad Sci U S A

September 2019

Rapid advances in genomic technologies have led to a wealth of diverse data, from which novel discoveries can be gleaned through the application of robust statistical and computational methods. Here, we describe GeneFishing, a semisupervised computational approach to reconstruct context-specific portraits of biological processes by leveraging gene-gene coexpression information. GeneFishing incorporates multiple high-dimensional statistical ideas, including dimensionality reduction, clustering, subsampling, and results aggregation, to produce robust results.

View Article and Find Full Text PDF

Metalearners for estimating heterogeneous treatment effects using machine learning.

Sören R Künzel Jasjeet S Sekhon Peter J Bickel Bin Yu

Proc Natl Acad Sci U S A

March 2019

There is growing interest in estimating and analyzing heterogeneous treatment effects in experimental and observational studies. We describe a number of metaalgorithms that can take advantage of any supervised learning or regression method in machine learning and statistics to estimate the conditional average treatment effect (CATE) function. Metaalgorithms build on base algorithms-such as random forests (RFs), Bayesian additive regression trees (BARTs), or neural networks-to estimate the CATE, a function that the base algorithms are not designed to estimate directly.

View Article and Find Full Text PDF

Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy.

Hamutal Arbel Sumanta Basu William W Fisher Ann S Hammonds Kenneth H Wan Peter J Bickel

Proc Natl Acad Sci U S A

January 2019

Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements.

View Article and Find Full Text PDF

Projection pursuit in high dimensions.

Peter J Bickel Gil Kur Boaz Nadler

Proc Natl Acad Sci U S A

September 2018

Projection pursuit is a classical exploratory data analysis method to detect interesting low-dimensional structures in multivariate data. Originally, projection pursuit was applied mostly to data of moderately low dimension. Motivated by contemporary applications, we here study its properties in high-dimensional settings.

View Article and Find Full Text PDF

Extensive cross-regulation of post-transcriptional regulatory networks in Drosophila.

Marcus H Stoiber Sara Olson Gemma E May Michael O Duff Jan Manent Peter J Bickel

Genome Res

November 2015

In eukaryotic cells, RNAs exist as ribonucleoprotein particles (RNPs). Despite the importance of these complexes in many biological processes, including splicing, polyadenylation, stability, transportation, localization, and translation, their compositions are largely unknown. We affinity-purified 20 distinct RNA-binding proteins (RBPs) from cultured Drosophila melanogaster cells under native conditions and identified both the RNA and protein compositions of these RNP complexes.

View Article and Find Full Text PDF

Comparative analysis of regulatory information and circuits across distant species.

Alan P Boyle Carlos L Araya Cathleen Brdlik Philip Cayting Chao Cheng Peter J Bickel

Nature

August 2014

Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.

View Article and Find Full Text PDF

Comparative analysis of the transcriptome across distant species.

Mark B Gerstein Joel Rozowsky Koon-Kiu Yan Daifeng Wang Chao Cheng Peter J Bickel

Nature

August 2014

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly.

View Article and Find Full Text PDF

Comparative validation of the D. melanogaster modENCODE transcriptome annotation.

Zhen-Xia Chen David Sturgill Jiaxin Qu Huaiyang Jiang Soo Park Peter J Bickel

Genome Res

July 2014

Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data.

View Article and Find Full Text PDF

Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-seq data.

Jingyi Jessica Li Haiyan Huang Peter J Bickel Steven E Brenner

Genome Res

July 2014

We report a statistical study to discover transcriptome similarity of developmental stages from D. melanogaster and C. elegans using modENCODE RNA-seq data.

View Article and Find Full Text PDF

System wide analyses have underestimated protein abundances and the importance of transcription in mammals.

Jingyi Jessica Li Peter J Bickel Mark D Biggin

PeerJ

April 2014

Large scale surveys in mammalian tissue culture cells suggest that the protein expressed at the median abundance is present at 8,000-16,000 molecules per cell and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find, however, that these surveys have significantly underestimated protein abundances and the relative importance of transcription. Using individual measurements for 61 housekeeping proteins to rescale whole proteome data from Schwanhausser et al.

View Article and Find Full Text PDF

Diversity and dynamics of the Drosophila transcriptome.

James B Brown Nathan Boley Robert Eisman Gemma E May Marcus H Stoiber Peter J Bickel

Nature

August 2014

Animal transcriptomes are dynamic, with each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. Here we have identified new genes, transcripts and proteins using poly(A)+ RNA sequencing from Drosophila melanogaster in cultured cell lines, dissected organ systems and under environmental perturbations. We found that a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing.

View Article and Find Full Text PDF

Navigating and mining modENCODE data.

Nathan Boley Kenneth H Wan Peter J Bickel Susan E Celniker

Methods

June 2014

modENCODE was a 5year NHGRI funded project (2007-2012) to map the function of every base in the genomes of worms and flies characterizing positions of modified histones and other chromatin marks, origins of DNA replication, RNA transcripts and the transcription factor binding sites that control gene expression. Here we describe the Drosophila modENCODE datasets and how best to access and use them for genome wide and individual gene studies.

View Article and Find Full Text PDF

Genome-guided transcript assembly by integrative analysis of RNA sequence data.

Nathan Boley Marcus H Stoiber Benjamin W Booth Kenneth H Wan Roger A Hoskins Peter J Bickel

Nat Biotechnol

April 2014

The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in the annotation of genomes. Here we describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which we call Generalized RNA Integration Tool, or GRIT. Applying GRIT to Drosophila melanogaster short-read RNA-seq, cap analysis of gene expression (CAGE) and poly(A)-site-seq data collected for the modENCODE project, we recovered the vast majority of previously annotated transcripts and doubled the total number of transcripts cataloged.

View Article and Find Full Text PDF

On robust regression with high-dimensional predictors.

Noureddine El Karoui Derek Bean Peter J Bickel Chinghway Lim Bin Yu

Proc Natl Acad Sci U S A

September 2013

We study regression M-estimates in the setting where p, the number of covariates, and n, the number of observations, are both large, but p ≤ n. We find an exact stochastic representation for the distribution of β = argmin(β∈ℝ(p)) Σ(i=1)(n) ρ(Y(i) - X(i')β) at fixed p and n under various assumptions on the objective function ρ and our statistical model. A scalar random variable whose deterministic limit rρ(κ) can be studied when p/n → κ > 0 plays a central role in this representation.

View Article and Find Full Text PDF

Optimal M-estimation in high-dimensional regression.

Derek Bean Peter J Bickel Noureddine El Karoui Bin Yu

Proc Natl Acad Sci U S A

September 2013

We consider, in the modern setting of high-dimensional statistics, the classic problem of optimizing the objective function in regression using M-estimates when the error distribution is assumed to be known. We propose an algorithm to compute this optimal objective function that takes into account the dimensionality of the problem. Although optimality is achieved under assumptions on the design matrix that will not always be satisfied, our analysis reveals generally interesting families of dimension-dependent objective functions.

View Article and Find Full Text PDF

DNA regions bound at low occupancy by transcription factors do not drive patterned reporter gene expression in Drosophila.

William W Fisher Jingyi Jessica Li Ann S Hammonds James B Brown Barret D Pfeiffer Peter J Bickel

Proc Natl Acad Sci U S A

December 2012

In animals, each sequence-specific transcription factor typically binds to thousands of genomic regions in vivo. Our previous studies of 20 transcription factors show that most genomic regions bound at high levels in Drosophila blastoderm embryos are known or probable functional targets, but genomic regions occupied only at low levels have characteristics suggesting that most are not involved in the cis-regulation of transcription. Here we use transgenic reporter gene assays to directly test the transcriptional activity of 104 genomic regions bound at different levels by the 20 transcription factors.

View Article and Find Full Text PDF

ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.

Stephen G Landt Georgi K Marinov Anshul Kundaje Pouya Kheradpour Florencia Pauli Peter Bickel

Genome Res

September 2012

Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment.

View Article and Find Full Text PDF

Long noncoding RNAs are rarely translated in two human cell lines.

Balázs Bánfai Hui Jia Jainab Khatun Emily Wood Brian Risk Peter Bickel

Genome Res

September 2012

Data from the Encyclopedia of DNA Elements (ENCODE) project show over 9640 human genome loci classified as long noncoding RNAs (lncRNAs), yet only ~100 have been deeply characterized to determine their role in the cell. To measure the protein-coding output from these RNAs, we jointly analyzed two recent data sets produced in the ENCODE project: tandem mass spectrometry (MS/MS) data mapping expressed peptides to their encoding genomic loci, and RNA-seq data generated by ENCODE in long polyA+ and polyA- fractions in the cell lines K562 and GM12878. We used the machine-learning algorithm RuleFit3 to regress the peptide data against RNA expression data.

View Article and Find Full Text PDF

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors.

Kevin Y Yip Chao Cheng Nitin Bhardwaj James B Brown Jing Leng Peter Bickel

Genome Biol

September 2012

Background: Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.

View Article and Find Full Text PDF

Systematic evaluation of factors influencing ChIP-seq fidelity.

Yiwen Chen Nicolas Negre Qunhua Li Joanna O Mieczkowska Matthew Slattery Peter J Bickel

Nat Methods

June 2012

We evaluated how variations in sequencing depth and other parameters influence interpretation of chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. Using Drosophila melanogaster S2 cells, we generated ChIP-seq data sets for a site-specific transcription factor (Suppressor of Hairy-wing) and a histone modification (H3K36me3). We detected a chromatin-state bias: open chromatin regions yielded higher coverage, which led to false positives if not corrected.

View Article and Find Full Text PDF

Nonparametric variable selection and modeling for spatial and temporal regulatory networks.

Anil Aswani Mark D Biggin Peter Bickel Claire Tomlin

Methods Cell Biol

July 2012

Because of the increasing diversity of data sets and measurement techniques in biology, a growing spectrum of modeling methods is being developed. It is generally recognized that it is critical to pick the appropriate method to exploit the amount and type of biological data available for a given system. Here, we describe a method for use in situations where temporal data from a network is collected over multiple time points, and in which little prior information is available about the interactions, mathematical structure, and statistical distribution of the network.

View Article and Find Full Text PDF