Statistical representation models for mutation information within genomic data.

BMC Bioinformatics

Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey.

Published: June 2019

Background: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is.

Results: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation.

Conclusions: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6567431PMC
http://dx.doi.org/10.1186/s12859-019-2868-4DOI Listing

Publication Analysis

Top Keywords

genomic data
8
cancer type
8
type classification
8
classification system
8
statistical representation
4
representation models
4
models mutation
4
mutation genomic
4
data
4
data background
4

Similar Publications

TRPV4 as a Novel Regulator of Ferroptosis in Colon Adenocarcinoma: Implications for Prognosis and Therapeutic Targeting.

Dig Dis Sci

January 2025

Ningxia Medical University, Xing Qing Block, Shengli Street No.1160, Yin Chuan City, 750004, Ningxia Province, People's Republic of China.

Background: Colon adenocarcinoma (COAD) is a leading cause of cancer-related mortality worldwide. Transient receptor potential vanilloid 4 (TRPV4), a calcium-permeable non-selective cation channel, has been implicated in various cancers, including COAD. This study investigates the role of TRPV4 in colon adenocarcinoma and elucidates its potential mechanism via the ferroptosis pathway.

View Article and Find Full Text PDF

Background: The endangered Kashmir musk deer (Moschus cupreus), native to high-altitude Himalayas, is an ecological significant and endangered ungulate, threatened by habitat loss and poaching for musk pod distributed in western Himalayan ranges of India, Nepal and Afghanistan. Despite its critical conservation status and ecological importance in regulating vegetation dynamics, knowledge gaps persist regarding its population structure and genetic diversity, hindering effective management strategies.

Methods And Results: We aimed to understand the population genetics of Kashmir musk deer in north-western Himalayas using two mitochondrial DNA (mtDNA) regions and 11 microsatellite loci.

View Article and Find Full Text PDF

We have recently shown that fluoxetine (FX) suppressed polyinosinic-polycytidylic acid-induced inflammatory response and endothelin release in human epidermal keratinocytes, via the indirect inhibition of the phosphoinositide 3-kinase (PI3K)-pathway. Because PI3K-signaling is a positive regulator of the proliferation, in the current, highly focused follow-up study, we assessed the effects of FX (14 µM) on the proliferation and differentiation of human epidermal keratinocytes. We found that FX exerted anti-proliferative actions in 2D cultures (HaCaT and primary human epidermal keratinocytes [NHEKs]; 48- and 72-h; CyQUANT-assay) as well as in 3D reconstructed epidermal equivalents (48-h; Ki-67 immunohistochemistry).

View Article and Find Full Text PDF

This study aimed to identify splicing quantitative trait loci (cis-sQTL) in Nelore cattle muscle tissue and explore the involvement of spliced genes (sGenes) in immune system-related biological processes. Genotypic data from 80 intact male Nelore cattle were obtained using SNP-Chip technology, while RNA-Seq analysis was performed to measure gene expression levels, enabling the integration of genomic and transcriptomic datasets. The normalized expression levels of spliced transcripts were associated with single nucleotide polymorphisms (SNPs) through an analysis of variance using an additive linear model with the MatrixEQTL package.

View Article and Find Full Text PDF

Some patients with metastatic castration-resistant prostate cancer (mCRPC) possess germline or acquired defects in the DNA damage repair (DDR) genes BRCA1 and BRCA2. Tumors with BRCA mutations exhibit sensitivity to poly-ADP ribose polymerase inhibitors (PARPi) such as olaparib and rucaparib. As a result, molecular diagnostic testing to identify patients with BRCA mutations eligible for the PARPi therapy has become an integral component of managing patients with mCRPC.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!