Towards a software defect proneness model: feature selection

Yakovyna, Vitaliy; Яковина, Віталій Степанович; Яковина, Виталий Степанович; Symets, Ivan; Симець, Іван Ігорович; Симец, Иван Игоревич

eONPUIR
→
1. Періодичні видання національного університету "Одеська політехніка"
→
Applied Aspects of Information Technology = Прикладні аспекти інформаційних технологій
→
2021, Vol. 4, № 4
→
Посмотреть элемент

dc.contributor.author	Yakovyna, Vitaliy
dc.contributor.author	Яковина, Віталій Степанович
dc.contributor.author	Яковина, Виталий Степанович
dc.contributor.author	Symets, Ivan
dc.contributor.author	Симець, Іван Ігорович
dc.contributor.author	Симец, Иван Игоревич
dc.date.accessioned	2021-12-27T18:45:17Z
dc.date.available	2021-12-27T18:45:17Z
dc.date.issued	2021-12-21
dc.identifier.citation	Yakovyna, V., Symets, І. (2021). Towards a software defect proneness model: feature selection. Аpplied Aspects of Information Technology, Vol. 4, N 4, p. 354–365.	en
dc.identifier.citation	Yakovyna, V. Towards a software defect proneness model: feature selection / V. Yakovyna, І. Symets // Аpplied Aspects of Information Technology = Прикладні аспекти інформ. технологій. – Оdesa, 2021. – Vol. 4, N 4. – P. 354–365.	en
dc.identifier.issn	2617-4316
dc.identifier.issn	2663-7723
dc.identifier.uri	http://dspace.opu.ua/jspui/handle/123456789/12038
dc.description.abstract	Дана стаття націлена на удосконалення статичних моделей надійності ПЗ за рахунок використання методів машинного навчання для вибору метрик коду ПЗ, що найсильніше впливають на його надійність.У дослідженні було використано злитий датасет з репозиторію PROMISE Software Engineering, який містив дані про тестування програмних модулів п’яти програм (КС1, КС2, PC1, CM1, JM1) та двадцять однуметрику коду. Для підготовленої вибірки було здійснено вибір найважливіших ознак, яківпливають на якість програмного коду за допомогою наступних методів вибору ознак: Boruta, Step-wise selection, Exhaustive Feature Selection, Random Forest Importance, LightGBM Importance, Genetic Algorithms, Principal Component Analysis, Xverse python.На основі голосування за результатами роботи методів вибору ознак побудовано статичну (детерміністичну) модель надійності програмного забезпечення, яка встановлює взаємозв’язок між ймовірністю появи дефекту в програмному модулі та метриками його коду. Показано, що в цю модель входять такі метрики коду яккількість гілок програми, кількість рядків коду та цикломатична складність за МакКейбом, загальна кількість операторів та операндів, інтелект, обсяг та кількість зусиль за Холстедом.Здійснено порівняння ефективності роботи різних методів вибору ознак, зокрема проведено дослідження впливу методу вибору ознак на точність класифікації із використанням наступних класифікаторів: Random Forest, Support Vector Machine, k-Nearest Neighbor, Decision Tree classifier,AdaBoost classifier, Gradient Boosting for classification. Показано, що використання будь-якого методу вибору ознак підвищує точність класифікації принаймні на десять процентівпорівняно з початковим датасетом, що підтверджує важливість цієї процедури дляпрогнозування дефектів програмного забезпечення на основі метричних датасетів, які містять значну кількість сильно корелюючих метрик коду ПЗ.Встановлено, що найкращу для більшості класифікаторів точність прогнозу вдалось отримати з використанням набору ознак, отриманого із запропонованої статичної моделі надійності ПЗ. Крім того, показано, що можливим також є використання окремих методів, таких як Autoencoder, Exhaustive Feature Selection та Principal Component Analysis з незначною втратою точності класифікації та прогнозування.	en
dc.description.abstract	This article is focused on improving static models of software reliability based on using machine learning methods to select the software code metrics that most strongly affect its reliability.The study used a merged dataset from the PROMISE Software Engineering repository, which contained data on testing software modules of fiveprograms and twenty-onecode metrics. For the prepared sampling, the most important features that affect the quality of software code have been selected using the following methods of feature selection: Boruta, Stepwiseselection, Exhaustive Feature Selection, Random Forest Importance, LightGBM Importance, Genetic Algorithms, Principal Component Analysis, Xverse python.Basing on the voting on the results of the work of the methods of feature selection, a static (deterministic) model of software reliability has been built, which establishes the relationship between the probability of a defect in the software module and the metrics of its code. It has been shown that this model includes such code metrics as branch countof a program, McCabe’s lines of codeand cyclomatic complexity, Halstead’s total number of operators and operands, intelligence, volume, andeffort value.A comparison of the effectiveness of different methods of feature selection has been put into practice, in particular, a study of the effectof the method of feature selection on the accuracy of classification using the following classifiers: Random Forest, Support Vector Machine, k-Nearest Neighbors, Decision Tree classifier, AdaBoost classifier, Gradient Boosting for classification. It has been shown that the use of any method of feature selection increases the accuracy of classification by at least ten percentcompared to the original dataset, which confirms the importance of this procedure for predicting software defects based on metric datasets that contain a significant number of highly correlated software code metrics.It has been found that the best accuracy of the forecast for most classifiers was reachedusing a set of features obtained from the proposed static model of software reliability. In addition, it has been shown that it is also possible to use separate methods, such as Autoencoder,Exhaustive Feature Selection and Principal Component Analysis withan insignificant loss of classification and prediction accuracy.	en
dc.language.iso	en	en
dc.publisher	Odessa National Polytechnic University	en
dc.subject	надійність програмного забезпечення;	en
dc.subject	машинне навчання;	en
dc.subject	дефект;	en
dc.subject	вибір ознак;	en
dc.subject	прогнозування дефектів програмного забезпечення	en
dc.subject	Software reliability;	en
dc.subject	machine learning algorithms;	en
dc.subject	defect;	en
dc.subject	feature selection;	en
dc.subject	software defect prediction	en
dc.title	Towards a software defect proneness model: feature selection	en
dc.title.alternative	Побудова моделі дефектності програм:вибір метрик	en
dc.type	Article	en
opu.citation.journal	Applied Aspects of Information Technology	en
opu.citation.volume	4	en
opu.citation.firstpage	354	en
opu.citation.lastpage	365	en
opu.citation.issue	4	en