Publications by authors named "Jeffrey Leek"

Computational data-centric research techniques play a prevalent and multi-disciplinary role in life science research. In the past, scientists in wet labs generated the data, and computational researchers focused on creating tools for the analysis of those data. Computational researchers are now becoming more independent and taking leadership roles within biomedical projects, leveraging the increased availability of public data.

View Article and Find Full Text PDF

Motivation: Software is vital for the advancement of biology and medicine. Impact evaluations of scientific software have primarily emphasized traditional citation metrics of associated papers, despite these metrics inadequately capturing the dynamic picture of impact and despite challenges with improper citation.

Results: To understand how software developers evaluate their tools, we conducted a survey of participants in the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI).

View Article and Find Full Text PDF

Data science education provides tremendous opportunities but remains inaccessible to many communities. Increasing the accessibility of data science to these communities not only benefits the individuals entering data science, but also increases the field's innovation and potential impact as a whole. Education is the most scalable solution to meet these needs, but many data science educators lack formal training in education.

View Article and Find Full Text PDF

Tens of thousands of RNA-sequencing experiments comprising hundreds of thousands of individual samples have now been performed. These data represent a broad range of experimental conditions, sequencing technologies, and hypotheses under study. The Recount project has aggregated and uniformly processed hundreds of thousands of publicly available RNA-seq samples.

View Article and Find Full Text PDF

Software is vital for the advancement of biology and medicine. Through analysis of usage and impact metrics of software, developers can help determine user and community engagement. These metrics can be used to justify additional funding, encourage additional use, and identify unanticipated use cases.

View Article and Find Full Text PDF

Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources and vignettes that accompany these tools often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resources (OTTR) to offer greater efficiency and flexibility for creating and maintaining these training resources.

View Article and Find Full Text PDF

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts.

View Article and Find Full Text PDF

We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features.

View Article and Find Full Text PDF
Article Synopsis
  • - Researchers aimed to create a detailed dataset of genes and proteins in megakaryocytes (MKs) derived from induced pluripotent stem cells (iPSCs) to better understand their biology.
  • - They successfully derived MKs from iPSCs taken from individuals of diverse backgrounds and confirmed that these cells expressed known markers important for platelet function, although expression levels varied by individual.
  • - Findings revealed that certain genes and proteins linked to platelet function were associated with higher MK marker expression, with differences noted based on sex and race, suggesting that individual-specific factors influence MK differentiation from iPSCs.
View Article and Find Full Text PDF

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference.

View Article and Find Full Text PDF
Article Synopsis
  • Genome-wide studies have found common genetic variants linked to platelet traits, but their biological significance is unclear due to their locations in non-coding regions.
  • In a study involving 290 participants, whole-genome and RNA sequencing were used to analyze induced pluripotent stem cell-derived megakaryocytes and blood platelets, leading to the identification of numerous cis-expression quantitative trait loci (eQTLs) specific to both cell types.
  • A significant majority of the eQTLs found were unique to megakaryocytes and platelets when compared to other tissues, suggesting a specialized regulatory mechanism affecting gene expression linked to platelet biology.
View Article and Find Full Text PDF

Breakthroughs in artificial intelligence (AI) hold enormous potential as it can automate complex tasks and go even beyond human performance. In their study, McKinney et al. showed the high potential of AI for breast cancer screening.

View Article and Find Full Text PDF
Article Synopsis
  • Long noncoding RNAs (lncRNAs) make up most of transcripts in mammalian genomes, but their functions are still not well understood.
  • The FANTOM6 project systematically knocked down 285 lncRNAs in human dermal fibroblasts and analyzed changes in cell growth, shape, and gene expression using CAGE techniques.
  • This study provides a comprehensive lncRNA knockdown data set (over 1000 CAGE sequencing libraries) and reveals important findings about their roles and impact on various cellular pathways.
View Article and Find Full Text PDF

Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study.

View Article and Find Full Text PDF
Article Synopsis
  • The study assessed how beginning R users perceived the quality of graphics created using two plotting systems: base R and ggplot2.
  • Both methods were used by students in a data science course to create similar plots, after which they evaluated each other's work based on clarity and labeling.
  • Results showed that while graphics from both systems were rated similarly, ggplot2 plots were generally considered clearer, particularly for more complex multivariate relationships.
View Article and Find Full Text PDF
Article Synopsis
  • An amendment to the original paper has been published.
  • A link to access this amendment can be found at the top of the paper.
  • Readers are encouraged to check the link for updated information.
View Article and Find Full Text PDF
Article Synopsis
  • Gene co-expression networks help identify the relationships between genes, which is vital for predicting their functions and understanding diseases.
  • Technical and biological artifacts in gene expression data can interfere with common methods used to reconstruct these networks, leading to inaccurate results.
  • By applying principal component correction to gene expression data before analyzing networks, we can significantly reduce false discoveries, as demonstrated using data from the GTEx project across various tissues.
View Article and Find Full Text PDF

Modern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate (FDR) is one of the most commonly used approaches for measuring and controlling error rates when performing multiple tests. Adaptive FDRs rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested.

View Article and Find Full Text PDF
Article Synopsis
  • Most researchers unintentionally lead readers to interpret observational studies as causal by explaining significant correlations.
  • A randomized controlled experiment in a 2013 online course tested whether providing explanations influenced readers' perceptions of an inferential analysis on smoking and cancer, resulting in a 15.2% increase in causal interpretations.
  • The findings suggest that explanations in scientific studies could mislead audiences, indicating a need for more careful qualification of such explanations in future research.
View Article and Find Full Text PDF

Genome-wide association studies have identified 108 schizophrenia risk loci, but biological mechanisms for individual loci are largely unknown. Using developmental, genetic and illness-based RNA sequencing expression analysis in human brain, we characterized the human brain transcriptome around these loci and found enrichment for developmentally regulated genes with novel examples of shifting isoform usage across pre- and postnatal life. We found widespread expression quantitative trait loci (eQTLs), including many with transcript specificity and previously unannotated sequence that were independently replicated.

View Article and Find Full Text PDF
Article Synopsis
  • Publicly available genomic data is crucial for studying human variation and diseases, but often lacks proper labeling and phenotype information, limiting its usefulness.
  • The researchers developed an in silico phenotyping method that utilizes well-annotated data to predict missing phenotypes from genomic measurements, focusing on 70,000 RNA-seq samples processed in the recount2 project.
  • Their approach helps to analyze public genomic data more effectively, allowing for the exploration of biological traits and experimental conditions, with the methods and predictions now accessible through the phenopredict and recount R packages.
View Article and Find Full Text PDF

Within the statistics community, a number of guiding principles for sharing data have emerged; however, these principles are not always made clear to collaborators generating the data. To bridge this divide, we have established a set of guidelines for sharing data. In these, we highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician.

View Article and Find Full Text PDF