cupertinocoderdojo.github.io/Classifying Malware Blog at master · CupertinoCoderDojo/cupertinocoderdojo.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Classifying Malware

This summmer, one major project Gavin and I took on was creating models that would classify malware. We experimented with numerous models and architectures and nelow is our process and the results we obtained.

Step 1: Data
The dataset we are using for our experiments is a VirusShare dataset. VirusShare is a collection of malware datasets and is commonly used for training malware classification and recognization models. In our dataset, we decided to find the most common ~20 or so families(families are types of malware such as zbot, winwebsec, adload, etc.) and create a dataset out of their operation codes(opcodes for short). After some data processing, we quickly realized that some families were much more frequent than others and decided to create an unbalanced dataset becuase that would give us the most data to work with. In the end, we took the 21 most common families and extracted all their operational codes using Radare2, a disassembler, and created a .csv file out of data. We have obtained 9000+ samples.


Step 2: Classic Models
With out data ready, we began experimenting with models. For our experiments, we decided to use both classic machine learning models, deep learning models, and ensemble methods. Our classic models comprised of a Random Forest model, a K-Neareset-Neighbors model, hidden markov models, a multilayer perceptron model, and a support vector machine. We also used Adaboost and Xgboost and labeled them as classic models although they can be also categorized as ensemble models. With these classic models, we found that Xgboost and Random forest not only performed the best with highest accuracy and balanced accuracy, but also were the most efficient and ran the fastest. To obtain the best model, we ran hyperparamter optimization using sklearn's randomized search algorithm. Here are a few pictures of our results.


Step 3: Deep Learning
With the classic models finished, we moved onto the deep learning portion of the experiments. We decided to use a CNN and LSTM model and find out which performed better. Below are pictures of the architectures of our CNN and LSTM. Our CNN was a 1-D CNN, unlike the standard 2-D CNN, since our data was in the form of consecutive opcodes and did not possess any higher dimensionality. Our LSTM was a bidirectional LSTM, which meant that it took into account both future and past data in order to make current predictions. In the end, while we found that the LSTM did have a higher accuracy and balanced accuracy compared to the CNN, the LSTM runtime was exteremely slow even while using Kaggle's GPU. However, both deep learning models performed considerably better compared to our classic models, which shows the power of neural networks.

Step 4: Ensemble methods
For our ensemble methods, we took the classic route of bagging, boosting, and stacking. However, our definition of stacking is mostly limited to voting, where we obtain multiple models and create a voting system out of them all with the majority being the prediction we use. We found that all ensemble methods greatly boosted our model metrics, especially the voting method, and you can see the results of our ensemble methods below.