The impact of training data selection on the software defect prediction performance and data complexity
Benyamin Langgu Sinaga, Sabrina Ahmad, Zuraida Abal Abas, Antasena Wahyu Anggarajati
Abstract
Directly learning a defect prediction model from cross-project datasets results in a model with poor performance. Hence, training data selection becomes a feasible solution to this problem. Limited comparative studies investigating the effect of training data selection on the prediction performance have presented contradictory results. Those studies also did not analyze why a training data selection method underperforms. This study aims to investigate the impact of training data selection on the defect prediction model and data complexity measures. The method is based on an empirical comparison between prediction performance and data complexity measure before and after selection. This study compared 13 training data selection methods on 61 projects using six classification algorithms and measured the data complexity using six complexity measures focusing on overlap class, noise level, and class imbalanced ratio. Experimental results indicate that the best method for each dataset varies depending on the dataset and classifiers. The training data selection most affects noise rate and class imbalance. We concluded that critically selecting the training data method could improve the performance of the prediction model. We recommend dealing with noise and unbalanced classes when designing training data methods.
Keywords
Comparative study; Cross-project defect prediction; Data complexity measure; Software defect prediction; Training data selection
DOI:
https://doi.org/10.11591/eei.v11i5.3698
Refbacks
There are currently no refbacks.
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License .
<div class="statcounter"><a title="hit counter" href="http://statcounter.com/free-hit-counter/" target="_blank"><img class="statcounter" src="http://c.statcounter.com/10241695/0/5a758c6a/0/" alt="hit counter"></a></div>
Bulletin of EEI Stats
Bulletin of Electrical Engineering and Informatics (BEEI) ISSN: 2089-3191, e-ISSN: 2302-9285 This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU) .