Background: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers.

Results: Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines.

Conclusions: In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4565116PMC
http://dx.doi.org/10.1186/1752-0509-9-S5-S1DOI Listing

Publication Analysis

Top Keywords

learning approaches
16
labeled data
16
ensemble-based semi-supervised
12
semi-supervised learning
12
amounts labeled
12
highly imbalanced
12
data
11
semi-supervised
8
imbalanced splice
8
splice site
8

Similar Publications

Health care decisions are increasingly informed by clinical decision support algorithms, but these algorithms may perpetuate or increase racial and ethnic disparities in access to and quality of health care. Further complicating the problem, clinical data often have missing or poor quality racial and ethnic information, which can lead to misleading assessments of algorithmic bias. We present novel statistical methods that allow for the use of probabilities of racial/ethnic group membership in assessments of algorithm performance and quantify the statistical bias that results from error in these imputed group probabilities.

View Article and Find Full Text PDF

Metabolic Engineering of Corynebacterium glutamicum for High-Level Production of 1,5-Pentanediol, a C5 Diol Platform Chemical.

Adv Sci (Weinh)

December 2024

Department of Chemical Engineering and Materials Science, Graduate Program in System Health Science and Engineering, Ewha Womans University, Seoul, 03760, Republic of Korea.

The biobased production of chemicals is essential for advancing a sustainable chemical industry. 1,5-Pentanediol (1,5-PDO), a five-carbon diol with considerable industrial relevance, has shown limited microbial production efficiency until now. This study presents the development and optimization of a microbial system to produce 1,5-PDO from glucose in Corynebacterium glutamicum via the l-lysine-derived pathway.

View Article and Find Full Text PDF

Nowadays, photoplethysmograph (PPG) technology is being used more often in smart devices and mobile phones due to advancements in information and communication technology in the health field, particularly in monitoring cardiac activities. Developing generative models to generate synthetic PPG signals requires overcoming challenges like data diversity and limited data available for training deep learning models. This paper proposes a generative model by adopting a genetic programming (GP) approach to generate increasingly diversified and accurate data using an initial PPG signal sample.

View Article and Find Full Text PDF

Background: Collaborative research with end-users is an effective way to generate meaningful research applications and support greater impact on practice and knowledge exchange. To address these needs, a Citizen Advisory Group (CAG) of nine older adults (ages 64-80, 67% women) was formed to advise scientists on the development of Brain Health PRO (BHPro), a web-based platform designed to increase dementia prevention literacy and awareness. The current study evaluated if the CAG met its objectives, how inclusion of the CAG aligned with collaborative research approaches, and the CAG's experience and satisfaction throughout the development process.

View Article and Find Full Text PDF

Impactful learning through simulation-based education involves effective planning and design. This can be a complex process requiring educators to master a varied toolkit of analysis tools, learning methodologies, and evaluative strategies; all to ensure engagement of learners in a meaningful and impactful way. Where there is a lack of thoughtful design, simulation-based education programmes may be inefficiently deployed at best, and completely ineffective or even harmful to learning and learners at worst.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!