Semantic analysis and classification of malware for UNIX-like operating systems with the use of machine learning methods

Mishchenko, Maksym; Міщенко, Максим  Валерійович; Мищенко, Максим  Валерьевич; Dorosh, Mariia; Дорош, Марія Сергіївна; Дорош, Мария Сергеевна

eONPUIR
→
1. Періодичні видання національного університету "Одеська політехніка"
→
Applied Aspects of Information Technology = Прикладні аспекти інформаційних технологій
→
2022, Vol. 5, № 4
→
Посмотреть элемент

dc.contributor.author	Mishchenko, Maksym
dc.contributor.author	Міщенко, Максим Валерійович
dc.contributor.author	Мищенко, Максим Валерьевич
dc.contributor.author	Dorosh, Mariia
dc.contributor.author	Дорош, Марія Сергіївна
dc.contributor.author	Дорош, Мария Сергеевна
dc.date.accessioned	2023-05-03T21:31:37Z
dc.date.available	2023-05-03T21:31:37Z
dc.date.issued	2022-12-28
dc.identifier.citation	Mishchenko, M., Dorosh, М. (2022). Semantic analysis and classification of malware for UNIX-like operating systems with the use of machine learning methods. Аpplied Aspects of Information Technology, Vol. 5, N 4, р. 371–386.	en
dc.identifier.citation	Mishchenko, M. Semantic analysis and classification of malware for UNIX-like operating systems with the use of machine learning methods / M. Mishchenko, М. Dorosh // Аpplied Aspects of Information Technology = Прикладні аспекти інформ. технологій. – Оdesa, 2022. – Vol. 5, N 4. – P. 371–386.	en
dc.identifier.issn	2617-4316
dc.identifier.issn	2663-7723
dc.identifier.uri	http://dspace.opu.ua/jspui/handle/123456789/13471
dc.description.abstract	The paper focuses on malware classification, based on semantic analysis of disassembled binaries sections’ opcodes with the use of n-grams, TF-IDF indicator and machine learning algorithms. The purpose of the research is to improve and extend the variety of methods for identifying malware developed for UNIX-likeoperating systems. The taskof the research is to create an algorithm, which can identify the types of threats in malicious binary files using n-grams, TF-IDF indicator and machine learning algorithms. Malware classification process can be based either on static or dynamic signatures. Static signatures can be represented as byte-code sequences, binary-assembled instructions, or importedlibraries. Dynamic signatures can be represented as the sequence of actions made by malware. We will use a static signatures strategy for semantic analysis and classification of malware. In this paper,we will work with binary ELF files, which is the mostcommon executable file type for UNIX-likeoperating systems. For the purpose of this research we gathered 2999 malwareELFfiles, using data from VirusShare and VirusTotal sites, and 959 non malware program files from /usr/bin directory in Linux operatingsystem. Each malware file represents one of 3 malware families: Gafgyt, Mirai, and Lightaidra, which are popular and harmful threats to UNIX systems. Each ELF file in dataset was labelled according to its type. The proposed classification algorithm consists of several preparation steps: disassembly of every ELF binary file from the dataset and semantically processing and vectorizing assembly instructions in each file section. For the settingclassification threshold, the Multinomial Naive Bayes model is used. Using the classification threshold, we define the sizefor n-grams and the section of the file, which will give the best classification results. For obtaining the best score, multiple machine learning models, along with hyperparameter optimization, will be used. As a metric of the accuracy of the designed algorithm, mean accuracy and weighted F1 score are used. Stochastic gradient descent for SVM model was selected as the best performing ML model, based on the obtained experimental results.Developed algorithm was experimentally proved to be effective for classifying malware for UNIX operating systems. Results were analyzed and used for making conclusions and suggestions for future work	en
dc.description.abstract	Стаття зосереджена на класифікації шкідливих програм на основі семантичного аналізу кодів операцій дизасембльованихсекцій бінарних виконуваних файлів з використанням n-грам, індикатора TF-IDF і алгоритмів машинного навчання. Метою дослідження є вдосконалення та розширення наявнихметодів ідентифікації шкідливих програм, розроблених для UNIX-подібних операційних систем. Завданнямдослідження є створення алгоритму, який може ідентифікувати типи загроз у шкідливих бінарних файлахдля UNIX-подібних системза допомогою n-грам, індикатора TF-IDF і алгоритмів машинного навчання. Процес класифікації шкідливих програм може базуватися на статичних або динамічних сигнатурах. Статичні сигнатури можуть бути представлені у вигляді послідовностей байт-коду, двійкових інструкцій або імпортованих бібліотек.Динамічні сигнатури можна представити як послідовність дій шкідливого ПЗ. Ми будемо використовувати стратегію статичних сигнатур для семантичного аналізу та класифікації шкідливих програм. У цій статті ми будемо працювати з двійковими файлами ELF, які є найпоширенішим типом виконуваних файлів для UNIX-подібних операційних систем. Для цілей цього дослідження було зібрано набір даних із 2999зразків шкідливих ELFфайлів, використовуючи дані із сайтів VirusShare та VirusTotal, а також 959 нешкідливих програмних файлів з директорії/usr/binв операційній системі Linux. Шкідливіфайлипредставляютьодне з 3 сімейств шкідливих програм: Gafgyt, Mirai та Lightaidra, які є поширенимизагрозами для UNIX-подібних систем. У отриманому наборі даних для кожного ELFфайлу було проставлено мітку відподвідно до його типу.Запропонований алгоритм класифікації складається з кількох етапів підготовки: дизасемблюваннякожного бінарного ELFфайлу із набору даних і семантична обробка та векторизація інструкцій зі кожної з секцій файлу. Для встановлення порогу класифікації використовується поліноміальнамодель Байєса. Використовуючи поріг класифікації, ми визначаємо розмірn-грамі секціюфайлу, якідадуть найкращі результатикласифікації. В результаті було виявлено, що найкраща точність класифікації отримана для n-gramрозміру 4 та секції rodata. Щоб отримати найкращу точність, будевикористано декілька моделей машинного навчання разом із оптимізацією гіперпараметрів. Як метрика точності розробленого алгоритму використовується середня точність і зважена оцінка F1. Стохастичний градієнтний спуск для моделі SVM було обрано як найкращу модель ML на основі отриманих експериментальних результатів. Експериментально підтверджено ефективність розробленого алгоритму для класифікації шкідливих програм для UNIX-подібних операційних систем. Результати були проаналізовані та використані для висновків та пропозицій для подальшої роботи.	en
dc.language.iso	en	en
dc.publisher	Odessа Polytechnic National University	en
dc.subject	Malware detection	en
dc.subject	machine learning	en
dc.subject	semantic analysis	en
dc.subject	multiclass classification	en
dc.subject	text mining	en
dc.subject	operating system	en
dc.subject	Виявлення шкідливого програмного забезпечення	en
dc.subject	машинне навчання	en
dc.subject	семантичний аналіз	en
dc.subject	багатокласова класифікація	en
dc.subject	інтелектуальний аналіз тексту	en
dc.subject	операційна система	en
dc.title	Semantic analysis and classification of malware for UNIX-like operating systems with the use of machine learning methods	en
dc.title.alternative	Семантичний аналіз і класифікація шкідливого програмного забезпечення для UNIX-подібних систем з використанням методів машинного навчання	en
dc.type	Article	en
opu.citation.journal	Applied Aspects of Information Technology	en
opu.citation.volume	4	en
opu.citation.firstpage	371	en
opu.citation.lastpage	386	en
opu.citation.issue	5	en