Predicting target genes of non-coding regulatory variants with IRT.

Bioinformatics

Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, 94305 CA, USA.

Published: August 2020

Summary: Interpreting genetic variants of unknown significance (VUS) is essential in clinical applications of genome sequencing for diagnosis and personalized care. Non-coding variants remain particularly difficult to interpret, despite making up a large majority of trait associations identified in genome-wide association studies (GWAS) analyses. Predicting the regulatory effects of non-coding variants on candidate genes is a key step in evaluating their clinical significance. Here, we develop a machine-learning algorithm, Inference of Connected expression quantitative trait loci (eQTLs) (IRT), to predict the regulatory targets of non-coding variants identified in studies of eQTLs. We assemble datasets using eQTL results from the Genotype-Tissue Expression (GTEx) project and learn to separate positive and negative pairs based on annotations characterizing the variant, gene and the intermediate sequence. IRT achieves an area under the receiver operating characteristic curve (ROC-AUC) of 0.799 using random cross-validation, and 0.700 for a more stringent position-based cross-validation. Further evaluation on rare variants and experimentally validated regulatory variants shows a significant enrichment in IRT identifying the true target genes versus negative controls. In gene-ranking experiments, IRT achieves a top-1 accuracy of 50% and top-3 accuracy of 90%. Salient features, including GC-content, histone modifications and Hi-C interactions are further analyzed and visualized to illustrate their influences on predictions. IRT can be applied to any VUS of interest and each candidate nearby gene to output a score reflecting the likelihood of regulatory effect on the expression level. These scores can be used to prioritize variants and genes to assist in patient diagnosis and GWAS follow-up studies.

Availability And Implementation: Codes and data used in this work are available at https://github.com/miaecle/eQTL_Trees.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7575052PMC
http://dx.doi.org/10.1093/bioinformatics/btaa254DOI Listing

Publication Analysis

Top Keywords

non-coding variants
12
target genes
8
variants
8
regulatory variants
8
irt achieves
8
irt
6
regulatory
5
predicting target
4
genes
4
non-coding
4

Similar Publications

Low-density lipoprotein cholesterol (LDL-C) is a well-established risk factor for cardiovascular disease, and it plays a causal role in the development of atherosclerosis. Genome-wide association studies (GWASs) have successfully identified hundreds of genetic variants associated with LDL-C. Most of these risk loci fall in non-coding regions of the genome, and it is unclear how these non-coding variants affect circulating lipid levels.

View Article and Find Full Text PDF

Interplay between genetics and epigenetics in lung fibrosis.

Int J Biochem Cell Biol

January 2025

Centre for Respiratory Research, Translational Medical Sciences, School of Medicine, University of Nottingham, UK; Nottingham NIHR Biomedical Research Centre, Nottingham, UK; Biodiscovery Institute, University Park, University of Nottingham, UK. Electronic address:

Lung fibrosis, including idiopathic pulmonary fibrosis (IPF), is a complex and devastating disease characterised by the progressive scarring of lung tissue leading to compromised respiratory function. Aberrantly activated fibroblasts deposit extracellular matrix components into the surrounding lung tissue, impairing lung function and capacity for gas exchange. Both genetic and epigenetic factors have been found to play a role in the pathogenesis of lung fibrosis, with emerging evidence highlighting the interplay between these two regulatory mechanisms.

View Article and Find Full Text PDF

Background: Several studies suggested the genetic association between IL10RA variants and susceptibility to Behcet's disease (BD). However, the precise mechanism of the association is still unknown. The purpose of this study was to investigate the mechanism underlying the genetic associations between IL10RA polymorphisms and the risk of BD.

View Article and Find Full Text PDF

Background: To elucidate the genetic and molecular mechanisms underlying psoriasis by employing an integrative multi-omics approach, using summary-data-based Mendelian randomization (SMR) to infer causal relationships among DNA methylation, gene expression, and protein levels in relation to psoriasis risk.

Methods: We conducted SMR analyses integrating genome-wide association study (GWAS) summary statistics with methylation quantitative trait loci (mQTL), expression quantitative trait loci (eQTL), and protein quantitative trait loci (pQTL) data. Publicly available datasets were utilized, including psoriasis GWAS data from the European Molecular Biology Laboratory-European Bioinformatics Institute and the UK Biobank.

View Article and Find Full Text PDF

Background: Emerging evidence suggests that non-coding somatic single nucleotide variants (SNVs) in cis-regulatory elements (CREs) contribute to cancer by disrupting gene expression networks. However, the role of non-coding SNVs in cancer, particularly neuroblastoma, remains largely unclear.

Methods: SNVs effect on CREs activity was evaluated by luciferase assays.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!