It is increasingly recognized that an important step towards improving overall health is to accurately measure biomarkers of health from the molecular activities prevalent in the oral cavity. We present a general methodology for computationally quantifying the activity of microbial functional pathways using metatranscriptomic data. We describe their implementation as a collection of eight oral pathway scores using a large salivary sample dataset (n = 9350), and we evaluate score associations with oropharyngeal disease phenotypes within an unseen independent cohort (n = 14,129).
View Article and Find Full Text PDFObjective: Oral squamous cell carcinoma (OSCC) and oropharyngeal squamous cell carcinoma (OPSCC) can go undetected resulting in late detection and poor outcomes. We describe the development and validation of CancerDetect for Oral & Throat cancer™ (CDOT), to detect markers of OSCC and/or OPSCC within a high-risk population.
Material And Methods: We collected saliva samples from 1,175 individuals who were 50 years or older, or adults with a tobacco use history.
The authors report here the development of a high-throughput, automated, inexpensive and clinically validated saliva metatranscriptome test that requires less than 100 μl of saliva. RNA is preserved at the time of sample collection, allowing for ambient-temperature transportation and storage for up to 28 days. Critically, the RNA preservative is also able to inactivate pathogenic microorganisms, rendering the samples noninfectious and allowing for safe and easy shipping.
View Article and Find Full Text PDFDespite advances in cancer treatment, the 5-year mortality rate for oral cancers (OC) is 40%, mainly due to the lack of early diagnostics. To advance early diagnostics for high-risk and average-risk populations, we developed and evaluated machine-learning (ML) classifiers using metatranscriptomic data from saliva samples (n = 433) collected from oral premalignant disorders (OPMD), OC patients (n = 71) and normal controls (n = 171). Our diagnostic classifiers yielded a receiver operating characteristics (ROC) area under the curve (AUC) up to 0.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
January 2022
Bulk samples of the same patient are heterogeneous in nature, comprising of different subpopulations (subclones) of cancer cells. Cells in a tumor subclone are characterized by unique mutational genotype profile. Resolving tumor heterogeneity by estimating the genotypes, cellular proportions and the number of subclones present in the tumor can help in understanding cancer progression and treatment.
View Article and Find Full Text PDFTumors are heterogeneous in the sense that they consist of multiple subpopulations of cells, referred to as subclones, each of which is characterized by a distinct profile of genomic variations such as somatic mutations. Inferring the underlying clonal landscape has become an important topic in that it can help in understanding cancer development and progression, and thereby help in improving treatment. We describe a novel state-space model, based on the feature allocation framework and an efficient sequential Monte Carlo (SMC) algorithm, using the somatic mutation data obtained from tumor samples to estimate the number of subclones, as well as their characterization.
View Article and Find Full Text PDFBMC Bioinformatics
January 2019
Background: Tumor samples are heterogeneous. They consist of varying cell populations or subclones and each subclone is characterized with a distinct single nucleotide variant (SNV) profile. This explains the source of genetic heterogeneity observed in tumor sequencing data.
View Article and Find Full Text PDFTumor samples obtained from a single cancer patient spatially or temporally often consist of varying cell populations, each harboring distinct mutations that uniquely characterize its genome. Thus, in any given samples of a tumor having more than two haplotypes, defined as a scaffold of single nucleotide variants (SNVs) on the same homologous genome, is evidence of heterogeneity because humans are diploid and we would therefore only observe up to two haplotypes if all cells in a tumor sample were genetically homogeneous. We characterize tumor heterogeneity by latent haplotypes and present state-space formulation of the feature allocation model for estimating the haplotypes and their proportions in the tumor samples.
View Article and Find Full Text PDFBMC Bioinformatics
December 2017
Background: Samples of molecular sequence data of a locus obtained from random individuals in a population are often related by an unknown genealogy. More importantly, population genetics parameters, for instance, the scaled population mutation rate Θ=4N μ for diploids or Θ=2N μ for haploids (where N is the effective population size and μ is the mutation rate per site per generation), which explains some of the evolutionary history and past qualities of the population that the samples are obtained from, is of significant interest.
Results: In this paper, we present the evolution of sequence data in a Bayesian framework and the approximation of the posterior distributions of the unknown parameters of the model, which include Θ via the sequential Monte Carlo (SMC) samplers for static models.
High-throughput gene expression data are often obtained from pure or complex (heterogeneous) biological samples. In the latter case, data obtained are a mixture of different cell types and the heterogeneity imposes some difficulties in the analysis of such data. In order to make conclusions on gene expresssion data obtained from heterogeneous samples, methods such as microdissection and flow cytometry have been employed to physically separate the constituting cell types.
View Article and Find Full Text PDFEURASIP J Bioinform Syst Biol
December 2016
Background: Gene expression time series data are usually in the form of high-dimensional arrays. Unfortunately, the data may sometimes contain missing values: for either the expression values of some genes at some time points or the entire expression values of a single time point or some sets of consecutive time points. This significantly affects the performance of many algorithms for gene expression analysis that take as an input, the complete matrix of gene expression measurement.
View Article and Find Full Text PDF