Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
df4cf05
Create README.md
hinnazeejah Aug 16, 2024
59b5892
Update README.md
hinnazeejah Aug 16, 2024
ee277ff
Add files via upload
hinnazeejah Aug 16, 2024
18cb750
Add files via upload
hinnazeejah Aug 16, 2024
a919b3a
Add files via upload
hinnazeejah Aug 16, 2024
72734a4
Add Random Forest Model
hinnazeejah Aug 16, 2024
a8626bd
Add files via upload
hinnazeejah Aug 16, 2024
dd56f7b
Update README.md
hinnazeejah Aug 16, 2024
15867bb
Update README.md
hinnazeejah Aug 16, 2024
0ad02e8
Update README.md
hinnazeejah Aug 16, 2024
92bb482
Update README.md
hinnazeejah Aug 16, 2024
285d297
Update README.md
hinnazeejah Aug 16, 2024
8173dd0
Update README.md
hinnazeejah Aug 16, 2024
977ec60
Update README.md
hinnazeejah Aug 16, 2024
7e9e7a4
Update README.md
hinnazeejah Aug 16, 2024
6497434
Update README.md
hinnazeejah Aug 16, 2024
a5bc058
Update README.md
hinnazeejah Aug 18, 2024
1cad858
Update README.md
hinnazeejah Aug 18, 2024
7df7f56
Update README.md
hinnazeejah Aug 18, 2024
ef69d33
Update README.md
hinnazeejah Aug 18, 2024
8cdeb86
Update README.md
hinnazeejah Aug 18, 2024
c179276
Update README.md
hinnazeejah Aug 18, 2024
6506324
Update README.md
richardp23 Aug 18, 2024
95ba941
Merge pull request #1 from richardp23/patch-1
hinnazeejah Aug 18, 2024
078a8ee
Update README.md
hinnazeejah Aug 18, 2024
b45146e
Update README.md
hinnazeejah Aug 18, 2024
e67b0ca
Update README.md
hinnazeejah Aug 18, 2024
67898ee
Update README.md
hinnazeejah Aug 18, 2024
863d0c1
Update README.md
hinnazeejah Aug 18, 2024
077c505
Update README.md
hinnazeejah Aug 18, 2024
02aa39a
Update README.md
hinnazeejah Aug 18, 2024
027ba7f
Update README.md
hinnazeejah Aug 18, 2024
18a4ae7
Update README.md
hinnazeejah Aug 18, 2024
d106cc2
Update README.md
hinnazeejah Aug 18, 2024
5976dd4
Modified notebook to import libraries, unzip dataset
richardp23 Aug 18, 2024
ba026bc
Merge pull request #2 from richardp23/main
hinnazeejah Aug 18, 2024
21a4aa8
Update README.md
hinnazeejah Aug 19, 2024
6556585
Update README.md
hinnazeejah Aug 19, 2024
c970d5c
Update README.md
hinnazeejah Aug 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions ml_dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Enhanced Web Attack Detection for TANNER: A Machine Learning Approach (Google Summer of Code 2024)

The dataset used in this project is derived from web traffic data collected by SNARE sensors over several years, with annotations based on regular expressions to identify various types of web-based attacks.

This dataset includes *four* primary types of attacks:
- **SQL Injection (SQLi)**: Malicious SQL queries are injected into input fields to manipulate databases
- **Cross-Site Scripting (XSS)**: Attackers inject malicious scripts into web pages viewed by other users
- **Local File Inclusion (LFI)**: Exploits vulnerabilities to access files on the server
- **Remote File Inclusion (RFI)**: Allows attackers to include remote files on a web server
- **Index**: Normal, benign web requests

To enhance the dataset's robustness and reduce noise and imbalance, supplementary data from external cybersecurity repositories were integrated. The dataset is constructed with 67 features extracted from URLs, followed by preprocessing steps such as label encoding, TF-IDF transformation on a character level, and meticulous cleaning to remove duplicates, address missing values, and eliminate outliers. Despite the imbalanced nature of the dataset, with a higher proportion of normal traffic, it serves as a robust foundation for developing a machine learning-based classifier intended to replace the existing regular expression-based detection in TANNER, aiming for improved accuracy and lower latency in real-time web attack detection.



**Data Preprocessing**

To prepare the dataset for modeling, I began by addressing any missing values—removing rows where essential data like path and payload were absent, and imputing values in less critical columns with averages or the most frequent entries. Duplicate records were eliminated to ensure the integrity of the dataset, and outliers were either transformed using log scaling or removed entirely if they were deemed erroneous. The target labels, representing different types of web traffic and attacks (SQLi, XSS, LFI, RFI, and normal traffic), were converted into numerical form through label encoding, making them ready for model consumption.

A significant step in the preprocessing was the application of TF-IDF (Term Frequency-Inverse Document Frequency) at the character level to the path and payload columns. This technique helped capture intricate patterns within the URLs and payloads, which are crucial for accurate web attack detection. Normalization and standardization were applied as needed to ensure that the data was on a consistent scale, particularly for features with a wide range of values. These preprocessing steps were crucial in laying a solid foundation for training robust machine learning models.



**Modeling Approach**

I tested several machine learning models on my dataset to determine which one provided the best performance for web attack detection. The models I evaluated include Random Forest, XGBoost, a Multi-layer Neural Network (Dense), LightGBM, and a Decision Tree. Each model was tuned with specific parameters, and their performance was measured using accuracy, f1 score, log loss, and a confusion matrix.

Among the models tested, the Random Forest model stood out as the best performer, achieving an accuracy of **94.48%**, f1 score of **94.18**, with a log loss of 0.1312. This model outperformed others in both accuracy and log loss, making it the optimal choice for further deployment. The confusion matrix for the Random Forest model also showed the best balance in correctly identifying different types of web traffic and attacks.

<img width="489" alt="Screenshot 2024-08-19 at 12 29 18 PM" src="https://github.com/user-attachments/assets/ae034a95-a835-4a0e-8515-53b2de9f0dc4">



**Comparison with Baseline (RegEx)**\
<img width="609" alt="Screenshot 2024-08-19 at 6 03 34 PM" src="https://github.com/user-attachments/assets/9ebf82b3-d4a0-4b43-8f72-e46b37fbe5c1">





- Number of Samples (**Entire Dataset**): 149,488
- Number of Samples (**Train Data**): 104,641
- Number of Samples (**Test Data**): 44,847


Class Distribition Guide:

- Label 0: Index
- Label 1: LFI
- Label 2: RFI
- Label 3: SQLi
- Label 4: Unknown
- Label 5: XSS

**Entire Dataset**

Class Distribution\
<img width="192" alt="Screenshot 2024-08-16 at 12 51 11 PM" src="https://github.com/user-attachments/assets/f7a3737e-0a3e-42ad-a701-317c243da1a5">

Statistics:
[full_data_statistics.csv](https://github.com/user-attachments/files/16646880/full_data_statistics.csv)


**Train Set**

Class Distribution:\
<img width="220" alt="train data label dist" src="https://github.com/user-attachments/assets/5534f301-b193-43dd-b691-71d012d6455f">


Statistics:
[train_data_statistics.csv](https://github.com/user-attachments/files/16646855/train_data_statistics.csv)

**Test Set**

Class Distribution:\
<img width="203" alt="test data label dist" src="https://github.com/user-attachments/assets/af8b7a65-972b-4532-8a31-9c46135a86d8">

Statistics:
[test_data_statistics.csv](https://github.com/user-attachments/files/16646870/test_data_statistics.csv)

**Future Work**


The Random Forest model’s significant improvement in detecting web-based attacks highlights its potential for enhancing the TANNER system. To integrate this model into TANNER, it could be deployed in many different ways.

*Microservice*: The Random Forest model can be deployed as a microservice within TANNER's existing architecture. This approach would allow the ML model to analyze incoming web traffic in real-time, alongside or in place of the current regex-based detection. By using a microservice architecture, the model can be updated or replaced independently, ensuring that TANNER remains adaptable to new threats.

*Real-Time Processing & Efficiency*: To ensure that the ML model can handle real-time traffic efficiently, optimizations such as model compression or the use of more lightweight models could be explored. Integrating the model with TANNER’s existing caching mechanisms could also reduce the computational load and improve response times.

*Continuous Learning & Adaptation*: One of the strengths of ML models is their ability to learn from new data. Future work could focus on implementing a pipeline that allows the Random Forest model to be retrained periodically with new attack data, ensuring that it remains up-to-date with the latest threat patterns. Also, we could try out a hybrid approach combining both regex-based and ML-based detection could be explored to leverage the strengths of both methods.

*Neural Network Approaches*: Transformer-based neural networks, known for their ability to capture complex patterns in sequential data, could be employed to further improve detection accuracy. These models, particularly those fine-tuned for cybersecurity, could be integrated into TANNER for more sophisticated threat analysis.

*Ensemble Learning*: Combining the Random Forest model with other models, including neural networks, through ensemble learning could lead to even higher detection rates. By leveraging the strengths of multiple algorithms, TANNER could achieve a more robust and adaptable defense against evolving web threats.

**External Datasets Used**

- [XSS Payload List - Cross Site Scripting Vulnerability Payload List](https://www.kitploit.com/2018/05/xss-payload-list-cross-site-scripting.html)
- [Deep-XSS](https://github.com/das-lab/deep-xss/blob/master/xssed.csvl)
- [RFI Payloads](https://github.com/infosec-au/fuzzdb/blob/master/attack-payloads/rfi/rfi.txt)
- [XSS-attack-detection](https://github.com/shalayiding/XSS-attack-detection-using-deep-learning/blob/main/XSS_dataset_mixed.csv)
- [Cross site scripting XSS dataset for Deep learning](https://www.kaggle.com/datasets/syedsaqlainhussain/cross-site-scripting-xss-dataset-for-deep-learning)
- [XSS Payload List](https://github.com/payloadbox/xss-payload-list/blob/master/Intruder/xss-payload-list.txt)
- [LFI Payloads](https://raw.githubusercontent.com/emadshanab/LFI-Payload-List/master/LFI%20payloads.txt)
- [SQL Injection Dataset](https://www.kaggle.com/datasets/sajid576/sql-injection-dataset)
- [sql injection dataset](https://www.kaggle.com/datasets/syedsaqlainhussain/sql-injection-dataset)
- [SQL-Injection-Detection](https://github.com/saptajitbanerjee/SQL-Injection-Detection)
- [Malicious And Benign URLs](https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls)
Loading