In this project, we'll be working with real malware and cleanware samples[1] which are in DOS/Windows executable (application/x-ms-dos-executable) format. The distribution of samples is as mentioned below:
Malware Type | Count | Notes |
---|---|---|
Backdoor | 3290 | |
Rootkit | 3290 | |
Trojan | 3290 | |
Virus | 3290 | |
Worm | 3290 | |
Cleanware | 3290 | Clean samples |
Total | 19740 |
Here, we aim to check the accuracy of malware detection models in identifying the malware and cleanware samples and then under adversarial attacks. By comparing the accuracy of models under these two conditions, we will see which ML model is more vulnerable to adversarial attacks resulting in compromised accuracy and increase in false positive rates.
This project is built using Python v3.10.12 and Jupyter Notebook v7.1.0. VSCode is used as the text editor.
Below is the list of libraries used in this project:
Libraries | Version |
---|---|
Pandas | v2.2.0 |
Seaborn | v0.13.2 |
NumPy | v1.26.4 |
Scickit Learn | v1.4.1.post1 |
Adversarial Robustness Toolbox (ART) | v1.17.1 |
Other python packages used:
Packages | Version |
---|---|
os | v3.10.12 |
tabulate | v0.9.0 |
warnings | v3.10.12 |
You can install the dependencies using this command: pip install -r requirements.txt
.
From the notebook folder, run the notebooks in following order:
- data_preprocessing.ipynb
- model_training.ipynb
- evasion_attack.ipynb
To run all the three notebooks together, run the notebook run_all.ipynb
. This will execute all the notebooks in the sequence mentioned above.
Alternatively, you can run these notebooks from browser on http://localhost:8888/
. Please check Jupyter configuration using jupyter notebook --generate-config
. Refer the quick guide The five-minute guide to setting up a Jupyter notebook server to setup and start Jupyter local server.
Note: Since we are dealing with real malware samples, please be careful while unpacking the file onto your system. Since Windows system will not support the download of these files, we have used Ubuntu v22.04.3 in a sandbox environment.
Following things were done after dataset selection:
- Convertion of DOS/Windows executable to bytes (in .txt format)
- Feature extraction from dataset
- Data Exploration and Validation
- Multivariate Analysis
- Splitting dataset to Train, Test, and Cross-Validation (CV)
- Building Confusion Matrix
Next step after data preprocessing is training our models. For this experiment, we have selected below models for analysis:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
We are using the training dataset created in Part 1 to train our models mentioned above.
Now that our models are trained, we can perform adversarial attack on the models. We have used the technique of Evasion Attack on the models for which we have used below attack techniques:
- Basic Iterative Method (BIM) - Used on Logistic Regression
- Zeroth Order Optimization (ZOO) - Used on Decision Tree Classifier and Random Forest Classifier
BIM, also called extended Fast Gradient Sign Method (FGSM), is generally more effective at crafting adversarial examples compared to FGSM. The iterative nature of BIM allows it to find adversarial perturbations that are smaller but still effective, making the adversarial examples harder to detect or defend against.
The ZOO attack is a method used in adversarial ML to craft adversarial examples without directly accessing the gradient information of the target model. In ZOO, the algorithm estimates the gradient of the loss function without actually computing it directly. ZOO uses this concept to iteratively perturb the input to generate adversarial examples. It does this by querying the target model multiple times with slightly perturbed inputs and using its responses to estimate the gradient direction.
The table below shows the performance metrics of the models before adversarial attack:
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Logistic Regression Model | 0.380095 | 0.448211 | 0.380095 | 0.359778 |
Decision Tree Model | 0.638962 | 0.637223 | 0.638962 | 0.637871 |
Random Forest Model | 0.746691 | 0.749962 | 0.746691 | 0.747005 |
The table below shows the performance metrics of the models after adversarial attack:
Model | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Logistic Regression Model | 0.154579 | 0.137604 | 0.154579 | 0.113087 |
Decision Tree Model | 0.532292 | 0.539864 | 0.532292 | 0.534349 |
Random Forest Model | 0.665961 | 0.673741 | 0.665961 | 0.6677 |
[1] E. de O. Andrade, “MC-dataset-multiclass,” Mar. 2018. [Online]. Available: https://figshare.com/articles/dataset/MC-dataset-multiclass/5995468/1