Publications by Sofia H | LitMetric

Publications by authors named "Sofia H"

Page 1 of 2

A draft human pangenome reference.

Wen-Wei Liao Mobin Asri Jana Ebler Daniel Doerr Marina Haukness

Nature

May 2023

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels.

View Article and Find Full Text PDF

FAVOR: functional annotation of variants online resource and annotator for variation across the human genome.

Hufeng Zhou Theodore Arapoglou Xihao Li Zilin Li Xiuwen Zheng

Nucleic Acids Res

January 2023

Large biobank-scale whole genome sequencing (WGS) studies are rapidly identifying a multitude of coding and non-coding variants. They provide an unprecedented resource for illuminating the genetic basis of human diseases. Variant functional annotations play a critical role in WGS analysis, result interpretation, and prioritization of disease- or trait-associated causal variants.

View Article and Find Full Text PDF

The Human Pangenome Project: a global resource to map genomic diversity.

Ting Wang Lucinda Antonacci-Fulton Kerstin Howe Heather A Lawson Julian K Lucas

Nature

April 2022

The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation.

View Article and Find Full Text PDF

GA4GH: International policies and standards for data sharing across genomic research and healthcare.

Heidi L Rehm Angela J H Page Lindsay Smith Jeremy B Adams Gil Alterovitz

Cell Genom

November 2021

The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution.

View Article and Find Full Text PDF

The Data Use Ontology to streamline responsible access to human biomedical datasets.

Jonathan Lawson Moran N Cabili Giselle Kerry Tiffany Boughtwood Adrian Thorogood

Cell Genom

November 2021

Human biomedical datasets that are critical for research and clinical studies to benefit human health also often contain sensitive or potentially identifying information of individual participants. Thus, care must be taken when they are processed and made available to comply with ethical and regulatory frameworks and informed consent data conditions. To enable and streamline data access for these biomedical datasets, the Global Alliance for Genomics and Health (GA4GH) Data Use and Researcher Identities (DURI) work stream developed and approved the Data Use Ontology (DUO) standard.

View Article and Find Full Text PDF

Ultrafast homomorphic encryption models enable secure outsourcing of genotype imputation.

Miran Kim Arif Ozgun Harmanci Jean-Philippe Bossuat Sergiu Carpov Jung Hee Cheon

Cell Syst

November 2021

Genotype imputation is a fundamental step in genomic data analysis, where missing variant genotypes are predicted using the existing genotypes of nearby "tag" variants. Although researchers can outsource genotype imputation, privacy concerns may prohibit genetic data sharing with an untrusted imputation service. Here, we developed secure genotype imputation using efficient homomorphic encryption (HE) techniques.

View Article and Find Full Text PDF

SCOR: A secure international informatics infrastructure to investigate COVID-19.

J L Raisaro Francesco Marino Juan Troncoso-Pastoriza Raphaelle Beau-Lejdstrom Riccardo Bellazzi

J Am Med Inform Assoc

November 2020

Global pandemics call for large and diverse healthcare data to study various risk factors, treatment options, and disease progression patterns. Despite the enormous efforts of many large data consortium initiatives, scientific community still lacks a secure and privacy-preserving infrastructure to support auditable data sharing and facilitate automated and legally compliant federated analysis on an international scale. Existing health informatics systems do not incorporate the latest progress in modern security and federated machine learning algorithms, which are poised to offer solutions.

View Article and Find Full Text PDF

iDASH secure genome analysis competition 2018: blockchain genomic data access logging, homomorphic encryption on GWAS, and DNA segment searching.

Tsung-Ting Kuo Xiaoqian Jiang Haixu Tang XiaoFeng Wang Tyler Bath

BMC Med Genomics

July 2020

View Article and Find Full Text PDF

Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines.

Kyle Ellrott Matthew H Bailey Gordon Saksena Kyle R Covington Cyriac Kandoth

Cell Syst

March 2018

The Cancer Genome Atlas (TCGA) cancer genomics dataset includes over 10,000 tumor-normal exome pairs across 33 different cancer types, in total >400 TB of raw data files requiring analysis. Here we describe the Multi-Center Mutation Calling in Multiple Cancers project, our effort to generate a comprehensive encyclopedia of somatic mutation calls for the TCGA data to enable robust cross-tumor-type analyses. Our approach accounts for variance and batch effects introduced by the rapid advancement of DNA extraction, hybridization-capture, sequencing, and analysis methods over time.

View Article and Find Full Text PDF

Simplifying research access to genomics and health data with Library Cards.

Moran N Cabili Knox Carey Stephanie O M Dyke Anthony J Brookes Marc Fiume

Sci Data

March 2018

The volume of genomics and health data is growing rapidly, driven by sequencing for both research and clinical use. However, under current practices, the data is fragmented into many distinct datasets, and researchers must go through a separate application process for each dataset. This is time-consuming both for the researchers and the data stewards, and it reduces the velocity of research and new discoveries that could improve human health.

View Article and Find Full Text PDF

A community effort to protect genomic data sharing, collaboration and outsourcing.

Shuang Wang Xiaoqian Jiang Haixu Tang Xiaofeng Wang Diyue Bu

NPJ Genom Med

October 2017

The human genome can reveal sensitive information and is potentially re-identifiable, which raises privacy and security concerns about sharing such data on wide scales. In 2016, we organized the third Critical Assessment of Data Privacy and Protection competition as a community effort to bring together biomedical informaticists, computer privacy and security researchers, and scholars in ethical, legal, and social implications (ELSI) to assess the latest advances on privacy-preserving techniques for protecting human genomic data. Teams were asked to develop novel protection methods for emerging genome privacy challenges in three scenarios: Track (1) data sharing through the Beacon service of the Global Alliance for Genomics and Health.

View Article and Find Full Text PDF

Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks.

Jean Louis Raisaro Florian Tramèr Zhanglong Ji Diyue Bu Yongan Zhao

J Am Med Inform Assoc

July 2017

The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context-a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or "beacon") is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards.While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable.

View Article and Find Full Text PDF

Protecting genomic data analytics in the cloud: state of the art and opportunities.

Haixu Tang Xiaoqian Jiang Xiaofeng Wang Shuang Wang Heidi Sofia

BMC Med Genomics

October 2016

The outsourcing of genomic data into public cloud computing settings raises concerns over privacy and security. Significant advancements in secure computation methods have emerged over the past several years, but such techniques need to be rigorously evaluated for their ability to support the analysis of human genomic data in an efficient and cost-effective manner. With respect to public cloud environments, there are concerns about the inadvertent exposure of human genomic data to unauthorized users.

View Article and Find Full Text PDF

Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma.

N Engl J Med

January 2016

Background: Papillary renal-cell carcinoma, which accounts for 15 to 20% of renal-cell carcinomas, is a heterogeneous disease that consists of various types of renal cancer, including tumors with indolent, multifocal presentation and solitary tumors with an aggressive, highly lethal phenotype. Little is known about the genetic basis of sporadic papillary renal-cell carcinoma, and no effective forms of therapy for advanced disease exist.

Methods: We performed comprehensive molecular characterization of 161 primary papillary renal-cell carcinomas, using whole-exome sequencing, copy-number analysis, messenger RNA and microRNA sequencing, DNA-methylation analysis, and proteomic analysis.

View Article and Find Full Text PDF

Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas.

N Engl J Med

June 2015

Background: Diffuse low-grade and intermediate-grade gliomas (which together make up the lower-grade gliomas, World Health Organization grades II and III) have highly variable clinical behavior that is not adequately predicted on the basis of histologic class. Some are indolent; others quickly progress to glioblastoma. The uncertainty is compounded by interobserver variability in histologic diagnosis.

View Article and Find Full Text PDF

Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia.

N Engl J Med

May 2013

Background: Many mutations that contribute to the pathogenesis of acute myeloid leukemia (AML) are undefined. The relationships between patterns of mutations and epigenetic phenotypes are not yet clear.

Methods: We analyzed the genomes of 200 clinically annotated adult cases of de novo AML, using either whole-genome sequencing (50 cases) or whole-exome sequencing (150 cases), along with RNA and microRNA sequencing and DNA-methylation analysis.

View Article and Find Full Text PDF

The third pillar of bacterial signal transduction: classification of the extracytoplasmic function (ECF) sigma factor protein family.

Anna Staroń Heidi J Sofia Sascha Dietrich Luke E Ulrich Heiko Liesegang

Mol Microbiol

November 2009

The ability of a bacterial cell to monitor and adaptively respond to its environment is crucial for survival. After one- and two-component systems, extracytoplasmic function (ECF) sigma factors - the largest group of alternative sigma factors - represent the third fundamental mechanism of bacterial signal transduction, with about six such regulators on average per bacterial genome. Together with their cognate anti-sigma factors, they represent a highly modular design that primarily facilitates transmembrane signal transduction.

View Article and Find Full Text PDF

Backbone 1H, 13C, and 15N NMR assignments for the Cyanothece 51142 protein cce_0567: a protein associated with nitrogen fixation in the DUF683 family.

Garry W Buchko Heidi J Sofia

Biomol NMR Assign

June 2008

Article Synopsis

Cyanothece 51142 has a 78-residue protein called cce_0567, which is linked to nitrogen fixation and belongs to the DUF683 protein family.
The study describes the resonance assignments for the main chain and 13C(beta) side chain of this protein.
The protein is characterized as a homo-tetramer with an approximate molecular weight of 40 kDa.

View Article and Find Full Text PDF

BioGraphE: high-performance bionetwork analysis using the Biological Graph Environment.

George Chin Daniel G Chavarria Grant C Nakamura Heidi J Sofia

BMC Bioinformatics

May 2008

Background: Graphs and networks are common analysis representations for biological systems. Many traditional graph algorithms such as k-clique, k-coloring, and subgraph matching have great potential as analysis techniques for newly available data in biology. Yet, as the amount of genomic and bionetwork information rapidly grows, scientists need advanced new computational strategies and tools for dealing with the complexities of the bionetwork analysis and the volume of the data.

View Article and Find Full Text PDF

A conserved structural module regulates transcriptional responses to diverse stress signals in bacteria.

Elizabeth A Campbell Roger Greenwell Jennifer R Anthony Sheng Wang Lionel Lim

Mol Cell

September 2007

A transcriptional response to singlet oxygen in Rhodobacter sphaeroides is controlled by the group IV sigma factor sigma(E) and its cognate anti-sigma ChrR. Crystal structures of the sigma(E)/ChrR complex reveal a modular, two-domain architecture for ChrR. The ChrR N-terminal anti-sigma domain (ASD) binds a Zn(2+) ion, contacts sigma(E), and is sufficient to inhibit sigma(E)-dependent transcription.

View Article and Find Full Text PDF

Phylogeny of the bacterial superfamily of Crp-Fnr transcription regulators: exploiting the metabolic spectrum by controlling alternative gene programs.

Heinz Körner Heidi J Sofia Walter G Zumft

FEMS Microbiol Rev

December 2003

The Crp-Fnr regulators, named after the first two identified members, are DNA-binding proteins which predominantly function as positive transcription factors, though roles of repressors are also important. Among over 1200 proteins with an N-terminally located nucleotide-binding domain similar to the cyclic adenosine monophosphate (cAMP) receptor protein, the distinctive additional trait of the Crp-Fnr superfamily is a C-terminally located helix-turn-helix motif for DNA binding. From a curated database of 369 family members exhibiting both features, we provide a protein tree of Crp-Fnr proteins according to their phylogenetic relationships.

View Article and Find Full Text PDF

beta -Amyloid peptide-induced apoptosis regulated by a novel protein containing a g protein activation module.

E M Kajkowski C F Lo X Ning S Walker H J Sofia

J Biol Chem

June 2001

Degeneration of neurons in Alzheimer's disease is mediated by beta-amyloid peptide by diverse mechanisms, which include a putative apoptotic component stimulated by unidentified signaling events. This report describes a novel beta-amyloid peptide-binding protein (denoted BBP) containing a G protein-coupling module. BBP is one member of a family of three proteins containing this conserved structure.

View Article and Find Full Text PDF

Radical SAM, a novel protein superfamily linking unresolved steps in familiar biosynthetic pathways with radical mechanisms: functional characterization using new analysis and information visualization methods.

H J Sofia G Chen B G Hetzler J F Reyes-Spindola N E Miller

Nucleic Acids Res

March 2001

A novel protein superfamily with over 600 members was discovered by iterative profile searches and analyzed with powerful bioinformatics and information visualization methods. Evidence exists that these proteins generate a radical species by reductive cleavage of S:-adenosylmethionine (SAM) through an unusual Fe-S center. The superfamily (named here Radical SAM) provides evidence that radical-based catalysis is important in a number of previously well- studied but unresolved biochemical pathways and reflects an ancient conserved mechanistic approach to difficult chemistries.

View Article and Find Full Text PDF

The complete DNA sequence and analysis of the large virulence plasmid of Escherichia coli O157:H7.

V Burland Y Shao N T Perna G Plunkett H J Sofia

Nucleic Acids Res

September 1998

The complete DNA sequence of pO157, the large virulence plasmid of EHEC strain O157:H7 EDL 933, is presented. The 92 kb F-like plasmid is composed of segments of putative virulence genes in a framework of replication and maintenance regions, with seven insertion sequence elements, located mostly at the boundaries of the virulence segments. One hundred open reading frames (ORFs) were identified, of which 19 were previously sequenced potential virulence genes.

View Article and Find Full Text PDF

Analysis of the Escherichia coli genome VI: DNA sequence of the region from 92.8 through 100 minutes.

V Burland G Plunkett H J Sofia D L Daniels F R Blattner

Nucleic Acids Res

June 1995

The 338.5 kb of the Escherichia coli genome described here together with previously described segments bring the total of contiguous finished sequence of this genome to > 1 Mb. Of 319 open reading frames (ORFs) found in this 338.

View Article and Find Full Text PDF