Front Med (Lausanne)
October 2024
Introduction: Data-driven medicine is essential for enhancing the accessibility and quality of the healthcare system. The availability of data plays a crucial role in achieving this goal.
Methods: We propose implementing a robust data infrastructure of FAIRification and data fusion for clinical, genomic, and imaging data.
Introduction: Epitopes are specific structures in antigens that are recognized by the immune system. They are widely used in the context of immunology-related applications, such as vaccine development, drug design, and diagnosis / treatment / prevention of disease. The SARS-CoV-2 virus has represented the main point of interest within the viral and genomic surveillance community in the last four years.
View Article and Find Full Text PDFBackground: Since the beginning of the COVID-19 pandemic, >1 million studies have been collected within the COVID-19 Open Research Dataset, a corpus of manuscripts created to accelerate research against the disease. Their related abstracts hold a wealth of information that remains largely unexplored and difficult to search due to its unstructured nature. Keyword-based search is the standard approach, which allows users to retrieve the documents of a corpus that contain (all or some of) the words in a target list.
View Article and Find Full Text PDFRecombination is a key molecular mechanism for the evolution and adaptation of viruses. The first recombinant SARS-CoV-2 genomes were recognized in 2021; as of today, more than ninety SARS-CoV-2 lineages are designated as recombinant. In the wake of the COVID-19 pandemic, several methods for detecting recombination in SARS-CoV-2 have been proposed; however, none could faithfully confirm manual analyses by experts in the field.
View Article and Find Full Text PDFWith the progression of the COVID-19 pandemic, large datasets of SARS-CoV-2 genome sequences were collected to closely monitor the evolution of the virus and identify the novel variants/strains. By analyzing genome sequencing data, health authorities can 'hunt' novel emerging variants of SARS-CoV-2 as early as possible, and then monitor their evolution and spread. We designed VariantHunter, a highly flexible and user-friendly tool for systematically monitoring the evolution of SARS-CoV-2 at global and regional levels.
View Article and Find Full Text PDFBackground: Literature about SARS-CoV-2 widely discusses the effects of variations that have spread in the past 3 years. Such information is dispersed in the texts of several research articles, hindering the possibility of practically integrating it with related datasets (e.g.
View Article and Find Full Text PDFBackground: SARS-CoV-2 viremia has been found to be a potential prognostic factor in patients hospitalized for COVID-19.
Objective: We aimed to assess the association between SARS-CoV-2 viremia and mortality in COVID-19 hospitalized patients during different epidemic periods.
Methods: A prospective COVID-19 registry was queried to extract all COVID-19 patients with an available SARS-CoV-2 viremia performed at hospital admission between March 2020 and January 2022.
Background: Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants - typically single-nucleotide polymorphisms (SNPs) - in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions.
Results: To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems.
BMC Bioinformatics
November 2022
Background: Genomics and virology are unquestionably important, but complex, domains being investigated by a large number of scientists. The need to facilitate and support work within these domains requires sharing of databases, although it is often difficult to do so because of the different ways in which data is represented across the databases. To foster semantic interoperability, models are needed that provide a deep understanding and interpretation of the concepts in a domain, so that the data can be consistently interpreted among researchers.
View Article and Find Full Text PDFBackground: Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics.
View Article and Find Full Text PDFThe inflation of SARS-CoV-2 lineages with a high number of accumulated mutations (such as the recent case of Omicron) has risen concerns about the evolutionary capacity of this virus. Here, we propose a computational study to examine non-synonymous mutations gathered within genomes of SARS-CoV-2 from the beginning of the pandemic until February 2022. We provide both qualitative and quantitative descriptions of such corpus, focusing on statistically significant co-occurring and mutually exclusive mutations within single genomes.
View Article and Find Full Text PDFSince the beginning of 2020, the COVID-19 pandemic has posed unprecedented challenges to viral data analysis and connected host disease diagnostic methods. We propose VirusLab, a flexible system for analysing SARS-CoV-2 viral sequences and relating them to metadata or clinical information about the host. VirusLab capitalizes on two existing resources: ViruSurf, a database of public SARS-CoV-2 sequences supporting metadata-driven search, and VirusViz, a tool for visual analysis of search results.
View Article and Find Full Text PDFThe Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.
View Article and Find Full Text PDFSince the outbreak of the COVID-19 pandemic, many research organizations have studied the genome of the SARS-CoV-2 virus; a body of public resources have been published for monitoring its evolution. While we experience an unprecedented richness of information in this domain, we also ascertained the presence of several information quality issues. We hereby propose CoV2K, an abstract model for explaining SARS-CoV-2-related concepts and interactions, focusing on viral mutations, their co-occurrence within variants, and their effects.
View Article and Find Full Text PDFMotivation: The ongoing evolution of SARS-CoV-2 and the rapid emergence of variants of concern at distinct geographic locations have relevant implications for the implementation of strategies for controlling the COVID-19 pandemic. Combining the growing body of data and the evidence on potential functional implications of SARS-CoV-2 mutations can suggest highly effective methods for the prioritization of novel variants of potential concern, e.g.
View Article and Find Full Text PDFThe COVID-19 pandemic has hit heavily many aspects of our lives. At this time, genomic research is concerned with exploiting available datasets and knowledge to fuel discovery on this novel disease. Studies that can precisely characterize the gene expression profiles of human hosts infected by SARS-CoV-2 are of significant relevance.
View Article and Find Full Text PDFSince its emergence in late 2019, the diffusion of SARS-CoV-2 is associated with the evolution of its viral genome. The co-occurrence of specific amino acid changes, collectively named 'virus variant', requires scrutiny (as variants may hugely impact the agent's transmission, pathogenesis, or antigenicity); variant evolution is studied using phylogenetics. Yet, never has this problem been tackled by digging into data with ad hoc analysis techniques.
View Article and Find Full Text PDFEpiSurf is a Web application for selecting viral populations of interest and then analyzing how their amino acid changes are distributed along epitopes. Viral sequences are searched within ViruSurf, which stores curated metadata and amino acid changes imported from the most widely used deposition sources for viral databases (GenBank, COVID-19 Genomics UK (COG-UK) and Global initiative on sharing all influenza data (GISAID)). Epitopes are searched within the open source Immune Epitope Database or directly proposed by users by indicating their start and stop positions in the context of a given viral protein.
View Article and Find Full Text PDFVariant visualization plays an important role in supporting the viral evolution analysis, extremely valuable during the COVID-19 pandemic. VirusViz is a web-based application for comparing variants of selected viral populations and their sub-populations; it is primarily focused on SARS-CoV-2 variants, although the tool also supports other viral species (SARS-CoV, MERS-CoV, Dengue, Ebola). As input, VirusViz imports results of queries extracting variants and metadata from the large database ViruSurf, which integrates information about most SARS-CoV-2 sequences publicly deposited worldwide.
View Article and Find Full Text PDFWith the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services.
View Article and Find Full Text PDFViruSurf, available at http://gmql.eu/virusurf/, is a large public database of viral sequences and integrated and curated metadata from heterogeneous sources (RefSeq, GenBank, COG-UK and NMDC); it also exposes computed nucleotide and amino acid variants, called from original sequences. A GISAID-specific ViruSurf database, available at http://gmql.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
February 2022
The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies.
View Article and Find Full Text PDFThousands of new experimental datasets are becoming available every day; in many cases, they are produced within the scope of large cooperative efforts, involving a variety of laboratories spread all over the world, and typically open for public use. Although the potential collective amount of available information is huge, the effective combination of such public sources is hindered by data heterogeneity, as the datasets exhibit a wide variety of notations and formats, concerning both experimental values and metadata. Thus, data integration is becoming a fundamental activity, to be performed prior to data analysis and biological knowledge discovery, consisting of subsequent steps of data extraction, normalization, matching and enrichment; once applied to heterogeneous data sources, it builds multiple perspectives over the genome, leading to the identification of meaningful relationships that could not be perceived by using incompatible data formats.
View Article and Find Full Text PDF