Publications by authors named "Jeffrey Leek"

Computational data-centric research techniques play a prevalent and multi-disciplinary role in life science research. In the past, scientists in wet labs generated the data, and computational researchers focused on creating tools for the analysis of those data. Computational researchers are now becoming more independent and taking leadership roles within biomedical projects, leveraging the increased availability of public data.

View Article and Find Full Text PDF

Motivation: Software is vital for the advancement of biology and medicine. Impact evaluations of scientific software have primarily emphasized traditional citation metrics of associated papers, despite these metrics inadequately capturing the dynamic picture of impact and despite challenges with improper citation.

Results: To understand how software developers evaluate their tools, we conducted a survey of participants in the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI).

View Article and Find Full Text PDF

Data science education provides tremendous opportunities but remains inaccessible to many communities. Increasing the accessibility of data science to these communities not only benefits the individuals entering data science, but also increases the field's innovation and potential impact as a whole. Education is the most scalable solution to meet these needs, but many data science educators lack formal training in education.

View Article and Find Full Text PDF

Tens of thousands of RNA-sequencing experiments comprising hundreds of thousands of individual samples have now been performed. These data represent a broad range of experimental conditions, sequencing technologies, and hypotheses under study. The Recount project has aggregated and uniformly processed hundreds of thousands of publicly available RNA-seq samples.

View Article and Find Full Text PDF

Software is vital for the advancement of biology and medicine. Through analysis of usage and impact metrics of software, developers can help determine user and community engagement. These metrics can be used to justify additional funding, encourage additional use, and identify unanticipated use cases.

View Article and Find Full Text PDF

Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources and vignettes that accompany these tools often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resources (OTTR) to offer greater efficiency and flexibility for creating and maintaining these training resources.

View Article and Find Full Text PDF

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts.

View Article and Find Full Text PDF

We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features.

View Article and Find Full Text PDF
Article Synopsis
  • - Researchers aimed to create a detailed dataset of genes and proteins in megakaryocytes (MKs) derived from induced pluripotent stem cells (iPSCs) to better understand their biology.
  • - They successfully derived MKs from iPSCs taken from individuals of diverse backgrounds and confirmed that these cells expressed known markers important for platelet function, although expression levels varied by individual.
  • - Findings revealed that certain genes and proteins linked to platelet function were associated with higher MK marker expression, with differences noted based on sex and race, suggesting that individual-specific factors influence MK differentiation from iPSCs.
View Article and Find Full Text PDF

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference.

View Article and Find Full Text PDF

Genome-wide association studies have identified common variants associated with platelet-related phenotypes, but because these variants are largely intronic or intergenic, their link to platelet biology is unclear. In 290 normal subjects from the GeneSTAR Research Study (110 African Americans [AAs] and 180 European Americans [EAs]), we generated whole-genome sequence data from whole blood and RNA sequence data from extracted nonribosomal RNA from 185 induced pluripotent stem cell-derived megakaryocyte (MK) cell lines (platelet precursor cells) and 290 blood platelet samples from these subjects. Using eigenMT software to select the peak single-nucleotide polymorphism (SNP) for each expressed gene, and meta-analyzing the results of AAs and EAs, we identify (q-value < 0.

View Article and Find Full Text PDF

Breakthroughs in artificial intelligence (AI) hold enormous potential as it can automate complex tasks and go even beyond human performance. In their study, McKinney et al. showed the high potential of AI for breast cancer screening.

View Article and Find Full Text PDF
Article Synopsis
  • Long noncoding RNAs (lncRNAs) make up most of transcripts in mammalian genomes, but their functions are still not well understood.
  • The FANTOM6 project systematically knocked down 285 lncRNAs in human dermal fibroblasts and analyzed changes in cell growth, shape, and gene expression using CAGE techniques.
  • This study provides a comprehensive lncRNA knockdown data set (over 1000 CAGE sequencing libraries) and reveals important findings about their roles and impact on various cellular pathways.
View Article and Find Full Text PDF

Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study.

View Article and Find Full Text PDF

We performed an empirical study of the perceived quality of scientific graphics produced by beginning R users in two plotting systems: the base graphics package ("base R") and the ggplot2 add-on package. In our experiment, students taking a data science course on the Coursera platform were randomized to complete identical plotting exercises using either base R or ggplot2. This exercise involved creating two plots: one bivariate scatterplot and one plot of a multivariate relationship that necessitated using color or panels.

View Article and Find Full Text PDF

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

View Article and Find Full Text PDF
Article Synopsis
  • Gene co-expression networks help identify the relationships between genes, which is vital for predicting their functions and understanding diseases.
  • Technical and biological artifacts in gene expression data can interfere with common methods used to reconstruct these networks, leading to inaccurate results.
  • By applying principal component correction to gene expression data before analyzing networks, we can significantly reduce false discoveries, as demonstrated using data from the GTEx project across various tissues.
View Article and Find Full Text PDF

Modern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate (FDR) is one of the most commonly used approaches for measuring and controlling error rates when performing multiple tests. Adaptive FDRs rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested.

View Article and Find Full Text PDF

Most researchers do not deliberately claim causal results in an observational study. But do we lead our readers to draw a causal conclusion unintentionally by explaining why significant correlations and relationships may exist? Here we perform a randomized controlled experiment in a massive open online course run in 2013 that teaches data analysis concepts to test the hypothesis that explaining an analysis will lead readers to interpret an inferential analysis as causal. We test this hypothesis with a single example of an observational study on the relationship between smoking and cancer.

View Article and Find Full Text PDF

Genome-wide association studies have identified 108 schizophrenia risk loci, but biological mechanisms for individual loci are largely unknown. Using developmental, genetic and illness-based RNA sequencing expression analysis in human brain, we characterized the human brain transcriptome around these loci and found enrichment for developmentally regulated genes with novel examples of shifting isoform usage across pre- and postnatal life. We found widespread expression quantitative trait loci (eQTLs), including many with transcript specificity and previously unannotated sequence that were independently replicated.

View Article and Find Full Text PDF

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data.

View Article and Find Full Text PDF

Within the statistics community, a number of guiding principles for sharing data have emerged; however, these principles are not always made clear to collaborators generating the data. To bridge this divide, we have established a set of guidelines for sharing data. In these, we highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician.

View Article and Find Full Text PDF