In the intricate landscape of healthcare analytics, effective feature selection is a prerequisite for generating robust predictive models, especially given the common challenges of sample sizes and potential biases. Zoish uniquely addresses these issues by employing Shapley additive values-an idea rooted in cooperative game theory-to enable both transparent and automated feature selection. Unlike existing tools, Zoish is versatile, designed to seamlessly integrate with an array of machine learning libraries including scikit-learn, XGBoost, CatBoost, and imbalanced-learn.
View Article and Find Full Text PDFWe have developed a machine learning (ML) approach using Gaussian process (GP)-based spatial covariance (SCV) to track the impact of spatial-temporal mutational events driving host-pathogen balance in biology. We show how SCV can be applied to understanding the response of evolving covariant relationships linking the variant pattern of virus spread to pathology for the entire SARS-CoV-2 genome on a daily basis. We show that GP-based SCV relationships in conjunction with genome-wide co-occurrence analysis provides an early warning anomaly detection (EWAD) system for the emergence of variants of concern (VOCs).
View Article and Find Full Text PDFGenotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques.
View Article and Find Full Text PDFJ Clin Endocrinol Metab
November 2022
Context: Aberrant biosynthesis and secretion of the insulin precursor proinsulin occurs in both type I and type II diabetes. Inflammatory cytokines are implicated in pancreatic islet stress and dysfunction in both forms of diabetes, but the mechanisms remain unclear.
Objective: We sought to determine the effect of the diabetes-associated cytokines on proinsulin folding, trafficking, secretion, and β-cell function.
The β-cell protein synthetic machinery is dedicated to the production of mature insulin, which requires the proper folding and trafficking of its precursor, proinsulin. The complete network of proteins that mediate proinsulin folding and advancement through the secretory pathway, however, remains poorly defined. Here we used affinity purification and mass spectrometry to identify, for the first time, the proinsulin biosynthetic interaction network in human islets.
View Article and Find Full Text PDFTo understand the impact of epigenetics on human misfolding disease, we apply Gaussian-process regression (GPR) based machine learning (ML) (GPR-ML) through variation spatial profiling (VSP). VSP generates population-based matrices describing the spatial covariance (SCV) relationships that link genetic diversity to fitness of the individual in response to histone deacetylases inhibitors (HDACi). Niemann-Pick C1 (NPC1) is a Mendelian disorder caused by >300 variants in the NPC1 gene that disrupt cholesterol homeostasis leading to the rapid onset and progression of neurodegenerative disease.
View Article and Find Full Text PDFTo date there has not been a study directly comparing relative Igκ rearrangement frequencies obtained from genomic DNA (gDNA) and cDNA and since each approach has potential biases, this is an important issue to clarify. Here we used deep sequencing to compare the unbiased gDNA and RNA Igκ repertoire from the same pre-B cell pool. We find that ~20% of Vκ genes have rearrangement frequencies ≥2-fold up or down in RNA vs.
View Article and Find Full Text PDFInherited and somatic rare diseases result from >200,000 genetic variants leading to loss- or gain-of-toxic function, often caused by protein misfolding. Many of these misfolded variants fail to properly interact with other proteins. Understanding the link between factors mediating the transcription, translation, and protein folding of these disease-associated variants remains a major challenge in cell biology.
View Article and Find Full Text PDFThe advent of precision medicine for genetic diseases has been hampered by the large number of variants that cause familial and somatic disease, a complexity that is further confounded by the impact of genetic modifiers. To begin to understand differences in onset, progression and therapeutic response that exist among disease-causing variants, we present the proteomic variant approach (ProVarA), a proteomic method that integrates mass spectrometry with genomic tools to dissect the etiology of disease. To illustrate its value, we examined the impact of variation in cystic fibrosis (CF), where 2025 disease-associated mutations in the CF transmembrane conductance regulator (CFTR) gene have been annotated and where individual genotypes exhibit phenotypic heterogeneity and response to therapeutic intervention.
View Article and Find Full Text PDFCCCTC-binding factor (CTCF) is largely responsible for the 3D architecture of the genome, in concert with the action of cohesin, through the creation of long-range chromatin loops. Cohesin is hypothesized to be the main driver of these long-range chromatin interactions by the process of loop extrusion. Here, we performed ChIP-seq for CTCF and cohesin in two stages each of T and B cell differentiation and examined the binding pattern in all six antigen receptor (AgR) loci in these lymphocyte progenitors and in mature T and B cells, ES cells, and fibroblasts.
View Article and Find Full Text PDFProc Natl Acad Sci U S A
July 2016
Ying Yang 1 (YY1) is a ubiquitously expressed transcription factor shown to be essential for pro-B-cell development. However, the role of YY1 in other B-cell populations has never been investigated. Recent bioinformatics analysis data have implicated YY1 in the germinal center (GC) B-cell transcriptional program.
View Article and Find Full Text PDFSRC kinase is activated in castration resistant prostate cancer (CRPC), phosphorylates the androgen receptor (AR), and causes its ligand-independent activation as a transcription factor. However, activating SRC mutations are exceedingly rare in human tumors, and mechanisms of ectopic SRC activation therefore remain largely unknown. Performing a functional genomics screen, we found that downregulation of SRC inhibitory kinase CSK is sufficient to overcome growth arrest induced by depriving human prostate cancer cells of androgen.
View Article and Find Full Text PDFBackground: Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility, and biological interpretability. Methods that take advantage of structured prior knowledge (eg, protein interaction networks) show promise in helping to define better signatures, but most knowledge remains unstructured.
View Article and Find Full Text PDFMotivation: Omics Pipe (http://sulab.scripps.edu/omicspipe) is a computational framework that automates multi-omics data analysis pipelines on high performance compute clusters and in the cloud.
View Article and Find Full Text PDFBenzo[a]pyrene (B[a]P) is an environmental contaminant mainly studied for its toxic/carcinogenic effects. For a comprehensive and pathway orientated mechanistic understanding of the effects directly triggered by a toxic (5 μM) or a subtoxic (50 nM) concentration of B[a]P or indirectly by its metabolites, we conducted time series experiments for up to 24 h to study the effects in murine hepatocytes. These cells rapidly take up and actively metabolize B[a]P, which was followed by quantitative analysis of the concentration of intracellular B[a]P and seven representative degradation products.
View Article and Find Full Text PDFStructured gene annotations are a foundation upon which many bioinformatics and statistical analyses are built. However the structured annotations available in public databases are a sparse representation of biological knowledge as a whole. The rate of biomedical data generation is such that centralized biocuration efforts struggle to keep up.
View Article and Find Full Text PDFThe primary antigen receptor repertoire is sculpted by the process of V(D)J recombination, which must strike a balance between diversification and favoring gene segments with specialized functions. The precise determinants of how often gene segments are chosen to complete variable region coding exons remain elusive. We quantified Vβ use in the preselection Tcrb repertoire and report relative contributions of 13 distinct features that may shape their recombination efficiencies, including transcription, chromatin environment, spatial proximity to their DβJβ targets, and predicted quality of recombination signal sequences (RSSs).
View Article and Find Full Text PDFA diverse Ab repertoire is formed through the rearrangement of V, D, and J segments at the IgH (Igh) loci. The C57BL/6 murine Igh locus has >100 functional VH gene segments that can recombine to a rearranged DJH. Although the nonrandom usage of VH genes is well documented, it is not clear what elements determine recombination frequency.
View Article and Find Full Text PDFBackground: The Gene Ontology and its associated annotations are critical tools for interpreting lists of genes. Here, we introduce a method for evaluating the Gene Ontology annotations and structure based on the impact they have on gene set enrichment analysis, along with an example implementation. This task-based approach yields quantitative assessments grounded in experimental data and anchored tightly to the primary use of the annotations.
View Article and Find Full Text PDFWe report a high quality and system-wide proteome catalogue covering 71% (3,542 proteins) of the predicted genes of fission yeast, Schizosaccharomyces pombe, presenting the largest protein dataset to date for this important model organism. We obtained this high proteome and peptide (11.4 peptides/protein) coverage by a combination of extensive sample fractionation, high resolution Orbitrap mass spectrometry, and combined database searching using the iProphet software as part of the Trans-Proteomics Pipeline.
View Article and Find Full Text PDFBackground: A variety of topic-focused wikis are used in the biomedical sciences to enable the mass-collaborative synthesis and distribution of diverse bodies of knowledge. To address complex problems such as defining the relationships between genes and disease, it is important to bring the knowledge from many different domains together. Here we show how advances in wiki technology and natural language processing can be used to automatically assemble 'meta-wikis' that present integrated views over the data collaboratively created in multiple source wikis.
View Article and Find Full Text PDFWikipedia is increasingly used as a platform for collaborative data curation, but its current technical implementation has significant limitations that hinder its use in biocuration applications. Specifically, while editors can easily link between two articles in Wikipedia to indicate a relationship, there is no way to indicate the nature of that relationship in a way that is computationally accessible to the system or to external developers. For example, in addition to noting a relationship between a gene and a disease, it would be useful to differentiate the cases where genetic mutation or altered expression causes the disease.
View Article and Find Full Text PDFThe study of expression quantitative trait loci (eQTL) is a powerful way of detecting transcriptional regulators at a genomic scale and for elucidating how natural genetic variation impacts gene expression. Power and genetic resolution are heavily affected by the study population: whereas recombinant inbred (RI) strains yield greater statistical power with low genetic resolution, using diverse inbred or outbred strains improves genetic resolution at the cost of lower power. In order to overcome the limitations of both individual approaches, we combine data from RI strains with genetically more diverse strains and analyze hippocampus eQTL data obtained from mouse RI strains (BXD) and from a panel of diverse inbred strains (Mouse Diversity Panel, MDP).
View Article and Find Full Text PDFAnalysis of expression quantitative trait loci (eQTL) provides a means for detecting transcriptional regulatory relationships at a genome-wide scale. Here we explain the eQTL analysis pipeline, we introduce publicly available tools for the statistical analysis, and we discuss issues that might complicate the eQTL mapping process. The detection and interpretation of eQTL requires careful consideration of a range of potentially confounding effects.
View Article and Find Full Text PDFHelicobacter pylori produces a heat shock protein A (HspA) that is unique to this bacteria. While the first 91 residues (domain A) of the protein are similar to GroES, the last 26 (domain B) are unique to HspA. Domain B contains eight histidines and four cysteines and was suggested to bind nickel.
View Article and Find Full Text PDF