An Empirical Study of Training Data Selection Methods for Ranking-Oriented Cross-Project Defect Prediction.

Sensors (Basel)

School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, China.

Published: November 2021

AI Article Synopsis

  • Ranking-oriented cross-project defect prediction (ROCPDP) aims to predict defects in new software projects by ranking modules based on their predicted defect count, but suffers from differences in data distributions between source and target projects.
  • The study analyzed nine training data selection methods to see how they affect ROCPDP model performance, revealing no significant differences in performance metrics between the methods.
  • While ROCPDP models using filtered cross-project data underperformed compared to ROWPDP models with ample within-project data, they outperformed ROWPDP models that had limited within-project data, suggesting that using data from other projects can be beneficial when within-project data is scarce.

Article Abstract

Ranking-oriented cross-project defect prediction (ROCPDP), which ranks software modules of a new target industrial project based on the predicted defect number or density, has been suggested in the literature. A major concern of ROCPDP is the distribution difference between the source project (aka. within-project) data and target project (aka. cross-project) data, which evidently degrades prediction performance. To investigate the impacts of training data selection methods on the performances of ROCPDP models, we examined the practical effects of nine training data selection methods, including a global filter, which does not filter out any cross-project data. Additionally, the prediction performances of ROCPDP models trained on the filtered cross-project data using the training data selection methods were compared with those of ranking-oriented within-project defect prediction (ROWPDP) models trained on sufficient and limited within-project data. Eleven available defect datasets from the industrial projects were considered and evaluated using two ranking performance measures, i.e., FPA and Norm(Popt). The results showed no statistically significant differences among these nine training data selection methods in terms of FPA and Norm(Popt). The performances of ROCPDP models trained on filtered cross-project data were not comparable with those of ROWPDP models trained on sufficient historical within-project data. However, ROCPDP models trained on filtered cross-project data achieved better performance values than ROWPDP models trained on limited historical within-project data. Therefore, we recommended that software quality teams exploit other project datasets to perform ROCPDP when there is no or limited within-project data.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8625928PMC
http://dx.doi.org/10.3390/s21227535DOI Listing

Publication Analysis

Top Keywords

models trained
24
training data
20
data selection
20
selection methods
20
within-project data
20
cross-project data
20
rocpdp models
16
data
15
defect prediction
12
performances rocpdp
12

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!