Publications by Jerome Kelleher | LitMetric

Publications by authors named "Jerome Kelleher"

Page 1 of 2

A general and efficient representation of ancestral recombination graphs.

Yan Wong Anastasia Ignatieva Jere Koskela Gregor Gorjanc Anthony W Wohns Jerome Kelleher

Genetics

September 2024

As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG.

View Article and Find Full Text PDF

Analysis-ready VCF at Biobank scale using Zarr.

Eric Czech Timothy R Millar Will Tyler Tom White Ben Jeffery Jerome Kelleher

bioRxiv

November 2024

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF.

View Article and Find Full Text PDF

tstrait: a quantitative trait simulator for ancestral recombination graphs.

Daiki Tagami Gertjan Bisschop Jerome Kelleher

Bioinformatics

June 2024

Summary: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs.

View Article and Find Full Text PDF

Estimating evolutionary and demographic parameters via ARG-derived IBD.

Zhendong Huang Jerome Kelleher Yao-Ban Chan David J Balding

bioRxiv

March 2024

Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data.

View Article and Find Full Text PDF

tstrait: a quantitative trait simulator for ancestral recombination graphs.

Daiki Tagami Gertjan Bisschop Jerome Kelleher

bioRxiv

March 2024

Summary: Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs.

View Article and Find Full Text PDF

link-ancestors: fast simulation of local ancestry with tree sequence software.

Georgia Tsambos Jerome Kelleher Peter Ralph Stephen Leslie Damjan Vukcevic

Bioinform Adv

November 2023

Summary: It is challenging to simulate realistic tracts of genetic ancestry on a scale suitable for simulation-based inference. We present an algorithm that enables this information to be extracted efficiently from tree sequences produced by simulations run with msprime and SLiM.

Availability And Implementation: A C-based implementation of the link-ancestors algorithm is in tskit (https://tskit.

View Article and Find Full Text PDF

A general and efficient representation of ancestral recombination graphs.

Yan Wong Anastasia Ignatieva Jere Koskela Gregor Gorjanc Anthony W Wohns Jerome Kelleher

bioRxiv

April 2024

As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG.

View Article and Find Full Text PDF

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations.

M Elise Lauterbur Maria Izabel A Cavassim Ariella L Gladstein Graham Gower Nathaniel S Pope Jerome Kelleher

Elife

June 2023

Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge.

View Article and Find Full Text PDF

On the genes, genealogies, and geographies of Quebec.

Luke Anderson-Trocmé Dominic Nelson Shadi Zabad Alex Diaz-Papkovich Ivan Kryukov Jerome Kelleher

Science

May 2023

Population genetic models only provide coarse representations of real-world ancestry. We used a pedigree compiled from 4 million parish records and genotype data from 2276 French and 20,451 French Canadian individuals to finely model and trace French Canadian ancestry through space and time. The loss of ancestral French population structure and the appearance of spatial and regional structure highlights a wide range of population expansion models.

View Article and Find Full Text PDF

Demes: a standard format for demographic models.

Graham Gower Aaron P Ragsdale Gertjan Bisschop Ryan N Gutenkunst Matthew Hartfield Jerome Kelleher

Genetics

November 2022

Understanding the demographic history of populations is a key goal in population genetics, and with improving methods and data, ever more complex models are being proposed and tested. Demographic models of current interest typically consist of a set of discrete populations, their sizes and growth rates, and continuous and pulse migrations between those populations over a number of epochs, which can require dozens of parameters to fully describe. There is currently no standard format to define such models, significantly hampering progress in the field.

View Article and Find Full Text PDF

Bayesian inference of ancestral recombination graphs.

Ali Mahmoudi Jere Koskela Jerome Kelleher Yao-Ban Chan David Balding

PLoS Comput Biol

March 2022

We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters.

View Article and Find Full Text PDF

A unified genealogy of modern and ancient genomes.

Anthony Wilder Wohns Yan Wong Ben Jeffery Ali Akbari Swapan Mallick Jerome Kelleher

Science

February 2022

The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans.

View Article and Find Full Text PDF

GA4GH: International policies and standards for data sharing across genomic research and healthcare.

Heidi L Rehm Angela J H Page Lindsay Smith Jeremy B Adams Gil Alterovitz Jerome Kelleher

Cell Genom

November 2021

The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution.

View Article and Find Full Text PDF

Efficient ancestry and mutation simulation with msprime 1.0.

Franz Baumdicker Gertjan Bisschop Daniel Goldstein Graham Gower Aaron P Ragsdale Jerome Kelleher

Genetics

March 2022

Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.

View Article and Find Full Text PDF

Lessons Learned from Bugs in Models of Human History.

Aaron P Ragsdale Dominic Nelson Simon Gravel Jerome Kelleher

Am J Hum Genet

October 2020

Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation.

View Article and Find Full Text PDF

A community-maintained standard library of population genetic models.

Jeffrey R Adrion Christopher B Cole Noah Dukler Jared G Galloway Ariella L Gladstein Jerome Kelleher

Elife

June 2020

The explosion in population genomic data demands ever more complex modes of analysis, and increasingly, these analyses depend on sophisticated simulations. Recent advances in population genetic simulation have made it possible to simulate large and complex models, but specifying such models for a particular simulation engine remains a difficult and error-prone task. Computational genetics researchers currently re-implement simulation models independently, leading to inconsistency and duplication of effort.

View Article and Find Full Text PDF

Accounting for long-range correlations in genome-wide simulations of large cohorts.

Dominic Nelson Jerome Kelleher Aaron P Ragsdale Claudia Moreau Gil McVean

PLoS Genet

May 2020

Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short.

View Article and Find Full Text PDF

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes.

Peter Ralph Kevin Thornton Jerome Kelleher

Genetics

July 2020

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence.

View Article and Find Full Text PDF

Coalescent Simulation with msprime.

Jerome Kelleher Konrad Lohse

Methods Mol Biol

January 2021

Coalescent simulation is a fundamental tool in modern population genetics. The msprime library provides unprecedented scalability in terms of both the simulations that can be performed and the efficiency with which the results can be processed. We show how coalescent models for population structure and demography can be constructed using a simple Python API, as well as how we can process the results of such simulations to efficiently calculate statistics of interest.

View Article and Find Full Text PDF

Publisher Correction: Inferring whole-genome histories in large population datasets.

Jerome Kelleher Yan Wong Anthony W Wohns Chaimaa Fadil Patrick K Albers

Nat Genet

November 2019

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

View Article and Find Full Text PDF

Inferring whole-genome histories in large population datasets.

Jerome Kelleher Yan Wong Anthony W Wohns Chaimaa Fadil Patrick K Albers

Nat Genet

September 2019

Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources.

View Article and Find Full Text PDF

Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes.

Benjamin C Haller Jared Galloway Jerome Kelleher Philipp W Messer Peter L Ralph

Mol Ecol Resour

March 2019

There is an increasing demand for evolutionary models to incorporate relatively realistic dynamics, ranging from selection at many genomic sites to complex demography, population structure, and ecological interactions. Such models can generally be implemented as individual-based forward simulations, but the large computational overhead of these models often makes simulation of whole chromosome sequences in large populations infeasible. This situation presents an important obstacle to the field that requires conceptual advances to overcome.

View Article and Find Full Text PDF

Efficient pedigree recording for fast population genetics simulation.

Jerome Kelleher Kevin R Thornton Jaime Ashander Peter L Ralph

PLoS Comput Biol

November 2018

In this paper we describe how to efficiently record the entire genetic history of a population in forwards-time, individual-based population genetics simulations with arbitrary breeding models, population structure and demography. This approach dramatically reduces the computational burden of tracking individual genomes by allowing us to simulate only those loci that may affect reproduction (those having non-neutral variants). The genetic history of the population is recorded as a succinct tree sequence as introduced in the software package msprime, on which neutral mutations can be quickly placed afterwards.

View Article and Find Full Text PDF

htsget: a protocol for securely streaming genomic data.

Jerome Kelleher Mike Lin C H Albach Ewan Birney Robert Davies

Bioinformatics

January 2019

Summary: Standardized interfaces for efficiently accessing high-throughput sequencing data are a fundamental requirement for large-scale genomic data sharing. We have developed htsget, a protocol for secure, efficient and reliable access to sequencing read and variation data. We demonstrate four independent client and server implementations, and the results of a comprehensive interoperability demonstration.

View Article and Find Full Text PDF

Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes.

Jerome Kelleher Alison M Etheridge Gilean McVean

PLoS Comput Biol

May 2016

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees.

View Article and Find Full Text PDF