Positive-unlabeled learning for disease gene identification.

Bioinformatics

Bioinformatics Research Centre, School of Computer Engineering, Nanyang Technological University, Singapore.

Published: October 2012

Background: Identifying disease genes from human genome is an important but challenging task in biomedical research. Machine learning methods can be applied to discover new disease genes based on the known ones. Existing machine learning methods typically use the known disease genes as the positive training set P and the unknown genes as the negative training set N (non-disease gene set does not exist) to build classifiers to identify new disease genes from the unknown genes. However, such kind of classifiers is actually built from a noisy negative set N as there can be unknown disease genes in N itself. As a result, the classifiers do not perform as well as they could be.

Result: Instead of treating the unknown genes as negative examples in N, we treat them as an unlabeled set U. We design a novel positive-unlabeled (PU) learning algorithm PUDI (PU learning for disease gene identification) to build a classifier using P and U. We first partition U into four sets, namely, reliable negative set RN, likely positive set LP, likely negative set LN and weak negative set WN. The weighted support vector machines are then used to build a multi-level classifier based on the four training sets and positive training set P to identify disease genes. Our experimental results demonstrate that our proposed PUDI algorithm outperformed the existing methods significantly.

Conclusion: The proposed PUDI algorithm is able to identify disease genes more accurately by treating the unknown data more appropriately as unlabeled set U instead of negative set N. Given that many machine learning problems in biomedical research do involve positive and unlabeled data instead of negative data, it is possible that the machine learning methods for these problems can be further improved by adopting PU learning methods, as we have done here for disease gene identification.

Availability And Implementation: The executable program and data are available at http://www1.i2r.a-star.edu.sg/~xlli/PUDI/PUDI.html.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3467748PMC
http://dx.doi.org/10.1093/bioinformatics/bts504DOI Listing

Publication Analysis

Top Keywords

disease genes
28
negative set
20
machine learning
16
learning methods
16
disease gene
12
set
12
training set
12
unknown genes
12
identify disease
12
disease
10

Similar Publications

During the 2023-2024 winter, 11 high pathogenicity avian influenza (HPAI) outbreaks caused by clade 2.3.4.

View Article and Find Full Text PDF

Gammaherpesviruses are oncogenic pathogens that establish lifelong infections. There are no FDA-approved vaccines against Epstein-Barr virus or Kaposi sarcoma herpesvirus. Murine gammaherpesvirus-68 (MHV68) infection of mice provides a system for investigating gammaherpesvirus pathogenesis and testing vaccine strategies.

View Article and Find Full Text PDF

Phenotypic Differences Between the Epidemic Strains of Vesicular Stomatitis Virus Serotype Indiana 98COE and IN0919WYB2 Using an In-Vivo Pig () Model.

Viruses

December 2024

National Bio- and Agro-Defense Facility, Agricultural Research Services, United States Department of Agriculture, Manhattan, KS 66506, USA.

During the past 25 years, vesicular stomatitis virus (VSV) has produced multiple outbreaks in the US, resulting in the emergence of different viral lineages. Currently, very little is known about the pathogenesis of many of these lineages, thus limiting our understanding of the potential biological factors favoring each lineage in these outbreaks. In this study, we aimed to determine the potential phenotypic differences between two VSV Indiana (VSIV) serotype epidemic strains using a pig model.

View Article and Find Full Text PDF

Kaposi's sarcoma-associated herpesvirus (KSHV) is a double-stranded DNA gamma herpesvirus. Like other herpesviruses, KSHV establishes a latent infection with limited gene expression, while KSHV occasionally undergoes the lytic replication phase, which produces KSHV progenies and infects neighboring cells. KSHV genome encodes 80+ open reading frames.

View Article and Find Full Text PDF

Rewriting Viral Fate: Epigenetic and Transcriptional Dynamics in KSHV Infection.

Viruses

November 2024

State Key Laboratory of Virology, College of Life Sciences, Wuhan University, Wuhan 430072, China.

Kaposi's sarcoma-associated herpesvirus (KSHV), a γ-herpesvirus, is predominantly associated with Kaposi's sarcoma (KS) as well as two lymphoproliferative disorders: primary effusion lymphoma (PEL) and multicentric Castleman disease (MCD). Like other herpesviruses, KSHV employs two distinct life cycles: latency and lytic replication. To establish a lifelong persistent infection, KSHV has evolved various strategies to manipulate the epigenetic machinery of the host.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!