Graph databases are becoming increasingly popular across scientific disciplines, being highly suitable for storing and connecting complex heterogeneous data. In systems biology, they are used as a backend solution for biological data repositories, ontologies, networks, pathways, and knowledge graph databases. In this review, we analyse all publications using or mentioning graph databases retrieved from PubMed and PubMed Central full-text search, focusing on the top 16 available graph databases, Publications are categorized according to their domain and application, focusing on pathway and network biology and relevant ontologies and tools.
View Article and Find Full Text PDFMass spectrometry-based proteomics allows the quantification of thousands of proteins, protein variants, and their modifications, in many biological samples. These are derived from the measurement of peptide relative quantities, and it is not always possible to distinguish proteins with similar sequences due to the absence of protein-specific peptides. In such cases, peptide signals are reported in protein groups that can correspond to several genes.
View Article and Find Full Text PDFProteins are the primary targets of almost all small molecule drugs. However, even the most selectively designed drugs can potentially target several unknown proteins. Identification of potential drug targets can facilitate design of new drugs and repurposing of existing ones.
View Article and Find Full Text PDFMotivation: Despite lifestyle factors (LSFs) being increasingly acknowledged in shaping individual health trajectories, particularly in chronic diseases, they have still not been systematically described in the biomedical literature. This is in part because no named entity recognition (NER) system exists, which can comprehensively detect all types of LSFs in text. The task is challenging due to their inherent diversity, lack of a comprehensive LSF classification for dictionary-based NER, and lack of a corpus for deep learning-based NER.
View Article and Find Full Text PDFMotivation: Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.
Results: We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes.
Motivation: Understanding biological processes relies heavily on curated knowledge of physical interactions between proteins. Yet, a notable gap remains between the information stored in databases of curated knowledge and the plethora of interactions documented in the scientific literature.
Results: To bridge this gap, we introduce ComplexTome, a manually annotated corpus designed to facilitate the development of text-mining methods for the extraction of complex formation relationships among biomedical entities targeting the downstream semantics of the physical interaction subnetwork of the STRING database.
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level.
View Article and Find Full Text PDFBioinformatics
September 2024
Motivation: Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly.
View Article and Find Full Text PDFTIN-X (Target Importance and Novelty eXplorer) is an interactive visualization tool for illuminating associations between diseases and potential drug targets and is publicly available at newdrugtargets.org. TIN-X uses natural language processing to identify disease and protein mentions within PubMed content using previously published tools for named entity recognition (NER) of gene/protein and disease names.
View Article and Find Full Text PDFImputation techniques provide means to replace missing measurements with a value and are used in almost all downstream analysis of mass spectrometry (MS) based proteomics data using label-free quantification (LFQ). Here we demonstrate how collaborative filtering, denoising autoencoders, and variational autoencoders can impute missing values in the context of LFQ at different levels. We applied our method, proteomics imputation modeling mass spectrometry (PIMMS), to an alcohol-related liver disease (ALD) cohort with blood plasma proteomics data available for 358 individuals.
View Article and Find Full Text PDFThe rising prevalence of liver diseases related to obesity and excessive use of alcohol is fuelling an increasing demand for accurate biomarkers aimed at community screening, diagnosis of steatohepatitis and significant fibrosis, monitoring, prognostication and prediction of treatment efficacy. Breakthroughs in omics methodologies and the power of bioinformatics have created an excellent opportunity to apply technological advances to clinical needs, for instance in the development of precision biomarkers for personalised medicine. Via omics technologies, biological processes from the genes to circulating protein, as well as the microbiome - including bacteria, viruses and fungi, can be investigated on an axis.
View Article and Find Full Text PDFThe Knowledge Management Center (KMC) for the Illuminating the Druggable Genome (IDG) project aims to aggregate, update, and articulate protein-centric data knowledge for the entire human proteome, with emphasis on the understudied proteins from the three IDG protein families. KMC collates and analyzes data from over 70 resources to compile the Target Central Resource Database (TCRD), which is the web-based informatics platform (Pharos). These data include experimental, computational, and text-mined information on protein structures, compound interactions, and disease and phenotype associations.
View Article and Find Full Text PDFMotivation: Protein networks are commonly used for understanding how proteins interact. However, they are typically biased by data availability, favoring well-studied proteins with more interactions. To uncover functions of understudied proteins, we must use data that are not affected by this literature bias, such as single-cell RNA-seq and proteomics.
View Article and Find Full Text PDFPancreatic cancer is one of the deadliest cancer types with poor treatment options. Better detection of early symptoms and relevant disease correlations could improve pancreatic cancer prognosis. In this retrospective study, we used symptom and disease codes (ICD-10) from the Danish National Patient Registry (NPR) encompassing 6.
View Article and Find Full Text PDFThe prevailing concept is that gestational alloimmune liver disease (GALD) is caused by maternal antibodies targeting a currently unknown antigen on the liver of the fetus. This leads to deposition of complement on the fetal hepatocytes and death of the fetal hepatocytes and extensive liver injury. In many cases, the newborn dies.
View Article and Find Full Text PDFBackground: Although the genome of Saccharomyces cerevisiae (S. cerevisiae) was the first one of a eukaryote organism that was fully sequenced (in 1996), a complete understanding of the potential of encoded biomolecular mechanisms has not yet been achieved. Here, we wish to quantify how far the goal of a full list of S.
View Article and Find Full Text PDFThe purpose of this study was to identify and validate new putative lead drug targets in drug-resistant mesial temporal lobe epilepsy (mTLE) starting from differentially expressed genes (DEGs) previously identified in mTLE in humans by transcriptome analysis. We identified consensus DEGs among two independent mTLE transcriptome datasets and assigned them status as "lead target" if they (1) were involved in neuronal excitability, (2) were new in mTLE, and (3) were druggable. For this, we created a consensus DEG network in STRING and annotated it with information from the DISEASES database and the Target Central Resource Database (TCRD).
View Article and Find Full Text PDFMotivation: The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora.
View Article and Find Full Text PDFMultiple myeloma (MM) is a neoplasia of B plasma cells that often induces bone pain. However, the mechanisms underlying myeloma-induced bone pain (MIBP) are mostly unknown. Using a syngeneic MM mouse model, we show that periosteal nerve sprouting of calcitonin gene-related peptide (CGRP) and growth associated protein 43 (GAP43) fibers occurs concurrent to the onset of nociception and its blockade provides transient pain relief.
View Article and Find Full Text PDFArena3D is an interactive web tool that visualizes multi-layered networks in 3D space. In this update, Arena3D supports directed networks as well as up to nine different types of connections between pairs of nodes with the use of Bézier curves. It comes with different color schemes (light/gray/dark mode), custom channel coloring, four node clustering algorithms which one can run on-the-fly, visualization in VR mode and predefined layer layouts (zig-zag, star and cube).
View Article and Find Full Text PDFBackground: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E.
View Article and Find Full Text PDFThe Illuminating the Druggable Genome (IDG) project aims to improve our understanding of understudied proteins and our ability to study them in the context of disease biology by perturbing them with small molecules, biologics, or other therapeutic modalities. Two main products from the IDG effort are the Target Central Resource Database (TCRD) (http://juniper.health.
View Article and Find Full Text PDFHypothesis-free high-throughput profiling allows relative quantification of thousands of proteins or transcripts across samples and thereby identification of differentially expressed genes. It is used in many biological contexts to characterize differences between cell lines and tissues, identify drug mode of action or drivers of drug resistance, among others. Changes in gene expression can also be due to confounding factors that were not accounted for in the experimental plan, such as change in cell proliferation.
View Article and Find Full Text PDF