Publications by authors named "Idoia Ochoa"

Whole-tissue transcriptomic analyses have been helpful to characterize molecular subtypes of hepatocellular carcinoma (HCC). Metabolic subtypes of human HCC have been defined, yet whether these different metabolic classes are clinically relevant or derive in actionable cancer vulnerabilities is still an unanswered question. Publicly available gene sets or gene signatures have been used to infer functional changes through gene set enrichment methods.

View Article and Find Full Text PDF

While myelodysplastic syndromes with del(5q) (del(5q) MDS) comprises a well-defined hematological subgroup, the molecular basis underlying its origin remains unknown. Using single cell RNA-seq (scRNA-seq) on CD34 progenitors from del(5q) MDS patients, we have identified cells harboring the deletion, characterizing the transcriptional impact of this genetic insult on disease pathogenesis and treatment response. Interestingly, both del(5q) and non-del(5q) cells present similar transcriptional lesions, indicating that all cells, and not only those harboring the deletion, may contribute to aberrant hematopoietic differentiation.

View Article and Find Full Text PDF

Motivation: The identification of minimal genetic interventions that modulate metabolic processes constitutes one of the most relevant applications of genome-scale metabolic models (GEMs). The concept of Minimal Cut Sets (MCSs) and its extension at the gene level, genetic Minimal Cut Sets (gMCSs), have attracted increasing interest in the field of Systems Biology to address this task. Different computational tools have been developed to calculate MCSs and gMCSs using both commercial and open-source software.

View Article and Find Full Text PDF

Grouping gene expression into gene set activity scores (GSAS) provides better biological insights than studying individual genes. However, existing gene set projection methods cannot return representative, robust, and interpretable GSAS. We developed NetActivity, a machine learning framework that generates GSAS based on a sparsely-connected autoencoder, where each neuron in the inner layer represents a gene set.

View Article and Find Full Text PDF
Article Synopsis
  • Tobacco significantly increases the risk of lung cancer, but some heavy smokers either develop it early or remain illness-free for many years, indicating a variability in susceptibility to cancer.
  • Researchers analyzed the genetic profiles of heavy smokers who either developed lung adenocarcinoma at a young age or did not develop it at an old age using Whole Exome Sequencing and Machine Learning to identify genetic variants linked to these extreme phenotypes.
  • The study validated multiple genetic variants and found that the gene HLA-A had the most variants associated with lower lung cancer risk, achieving a notable prediction accuracy with machine learning models, suggesting potential pathways for further research into lung cancer prevention.
View Article and Find Full Text PDF
Article Synopsis
  • Drug-target interaction (DTI) prediction is crucial for drug repurposing but is currently hindered by expensive computational methods and poor generalization to new datasets.
  • The paper introduces GeNNius, a Graph Neural Network-based method that not only enhances accuracy and efficiency in DTI prediction but also uncovers new drug-target interactions.
  • GeNNius maintains biological relevance in its data representation and is available for public use on GitHub, promising improvements in DTI prediction capabilities.
View Article and Find Full Text PDF

Purpose: To assess the suitability of machine learning (ML) techniques in predicting the development of fibrosis and atrophy in patients with neovascular age-related macular degeneration (nAMD), receiving anti-VEGF treatment over a 36-month period.

Methods: An extensive analysis was conducted on the use of ML to predict fibrosis and atrophy development on nAMD patients at 36 months from start of anti-VEGF treatment, using only data from the first 12 months. We use data collected according to real-world practice, which includes clinical and genetic factors.

View Article and Find Full Text PDF

Motivation: Single-nucleotide variants (SNVs) are the most common type of genetic variation in the human genome. Accurate and efficient detection of SNVs from next-generation sequencing (NGS) data is essential for various applications in genomics and personalized medicine. However, SNV calling methods usually suffer from high computational complexity and limited accuracy.

View Article and Find Full Text PDF

Emerging ultra-low coverage single-cell DNA sequencing (scDNA-seq) technologies have enabled high resolution evolutionary studies of copy number aberrations (CNAs) within tumors. While these sequencing technologies are well suited for identifying CNAs due to the uniformity of sequencing coverage, the sparsity of coverage poses challenges for the study of single-nucleotide variants (SNVs). In order to maximize the utility of increasingly available ultra-low coverage scDNA-seq data and obtain a comprehensive understanding of tumor evolution, it is important to also analyze the evolution of SNVs from the same set of tumor cells.

View Article and Find Full Text PDF

Motivation: The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files.

Results: We polished assemblies for a mock microbial community and a human genome, and we called variants on a human genome.

View Article and Find Full Text PDF

Single-cell RNA-Sequencing has the potential to provide deep biological insights by revealing complex regulatory interactions across diverse cell phenotypes at single-cell resolution. However, current single-cell gene regulatory network inference methods produce a single regulatory network per input dataset, limiting their capability to uncover complex regulatory relationships across related cell phenotypes. We present SimiC, a single-cell gene regulatory inference framework that overcomes this limitation by jointly inferring distinct, but related, gene regulatory dynamics per phenotype.

View Article and Find Full Text PDF

Motivation: An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods to remove potential batch effects between source (annotated) and target (unannotated) datasets.

View Article and Find Full Text PDF

Motivation: Mass spectrometry (MS) data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for MS data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file.

View Article and Find Full Text PDF

Motivation: Nanopore sequencing technologies are rapidly gaining popularity, in part, due to the massive amounts of genomic data they produce in short periods of time (up to 8.5 TB of data in <72 h). To reduce the costs of transmission and storage, efficient compression methods for this type of data are needed.

View Article and Find Full Text PDF

Intra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data.

View Article and Find Full Text PDF

The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy.

View Article and Find Full Text PDF

Motivation: Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form.

View Article and Find Full Text PDF

Motivation: The amount of genomic data generated globally is seeing explosive growth, leading to increasing needs for processing, storage and transmission resources, which motivates the development of efficient compression tools for these data. Work so far has focused mainly on the compression of data generated by short-read technologies. However, nanopore sequencing technologies are rapidly gaining popularity due to the advantages offered by the large increase in the average size of the produced reads, the reduction in their cost and the portability of the sequencing technology.

View Article and Find Full Text PDF

Motivation: Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases.

View Article and Find Full Text PDF

Motivation: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data.

View Article and Find Full Text PDF

Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance.

View Article and Find Full Text PDF

Background: Mass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses. As a result, the amount of MS data has significantly increased in recent years. For example, the MS repository MassIVE contains more than 123TB of data.

View Article and Find Full Text PDF

Motivation: High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression.

View Article and Find Full Text PDF

Motivation: The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files.

View Article and Find Full Text PDF

Motivation: DNA methylation is one of the most important epigenetic mechanisms in cells that exhibits a significant role in controlling gene expressions. Abnormal methylation patterns have been associated with cancer, imprinting disorders and repeat-instability diseases. As inexpensive bisulfite sequencing approaches have led to significant efforts in acquiring methylation data, problems of data storage and management have become increasingly important.

View Article and Find Full Text PDF