Evaluation of BIC and cross validation for model selection on sequence segmentations.

Int J Data Min Bioinform

HIIT, University of Helsinki and Helsinki University of Technology, P.O. Box 68, FI-00014 University of Helsinki, Finland.

Published: April 2011

Segmentation is a general data mining technique for summarising and analysing sequential data. Segmentation can be applied, e.g., when studying large-scale genomic structures such as isochores. Choosing the number of segments remains a challenging question. We present extensive experimental studies on model selection techniques, Bayesian Information Criterion (BIC) and Cross Validation (CV). We successfully identify segments with different means or variances, and demonstrate the effect of linear trends and outliers, frequently occurring in real data. Results are given for real DNA sequences with respect to changes in their codon, G + C, and bigram frequencies, and copy-number variation from CGH data.

Download full-text PDF

Source
http://dx.doi.org/10.1504/ijdmb.2010.037547DOI Listing

Publication Analysis

Top Keywords

bic cross
8
cross validation
8
model selection
8
evaluation bic
4
validation model
4
selection sequence
4
sequence segmentations
4
segmentations segmentation
4
segmentation general
4
data
4

Similar Publications

Despite advances in diagnostic techniques, accurate classification of lung cancer subtypes remains crucial for treatment planning. Traditional methods like genomic studies face limitations such as high cost and complexity. This study investigates whether integrating atomic force microscopy (AFM) measurements with conventional clinical and histopathological data can improve lung cancer subtype classification.

View Article and Find Full Text PDF

Background: The first trimester of pregnancy is critical for fetal development, making early antenatal care visits essential for timely check-ups and managing potential complications. However, delayed antenatal care initiation remains a public health challenge in sub-Saharan Africa, including Kenya. Therefore, this study aimed to assess and provide up-to-date information on time to first antenatal care visit and its predictors among women in Kenya, using data from the most recent 2022 Kenya Demographic and Health Survey (KDHS).

View Article and Find Full Text PDF

Background: Accurate fasting plasma glucose (FPG) trend prediction is important for management and treatment of patients with type 2 diabetes mellitus (T2DM), a globally prevalent chronic disease. (Generalised) linear mixed-effects (LME) models and machine learning (ML) are commonly used to analyse longitudinal data; however, the former is insufficient for dealing with complex, nonlinear data, whereas with the latter, random effects are ignored. The aim of this study was to develop LME, back propagation neural network (BPNN), and mixed-effects NN models that combine the 2 to predict FPG levels.

View Article and Find Full Text PDF

Using dense genomic markers opens up new opportunities and challenges for breeding programs. The need to penalize marker-specific regression coefficients becomes particularly important when dense markers are available. Therefore, fitting the marker effects to observations using a regularization technique, such as Bayesian LASSO (BL) regression, is of great interesting.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!