ContextCleanse - Email Intelligence

🚀 Project Overview

ContextCleanse is an email spam detection system that combines traditional machine learning with Reinforcement Learning (RL) techniques and adds an LLM assistant with RAG Pipeline on top of it. The system features seven different models, including the XGBoost + RL model, which continuously learns and improves from user feedback. Users can ask about the content of their Google emails or instruct the assistant to extract certain information from their emails.

Key Features

🤖 Assistant: LLM hosted with Ollama featuring RAG pipeline and streaming responses
🧠 Reinforcement Learning: Deep Q-Learning + Policy Gradient with real user feedback
📊 7 ML Models: Complete implementation with actual performance metrics
🔍 Semantic Search: Vector embeddings for intelligent email retrieval
⚡ Real-time Learning: 11+ real user feedback samples processed
📧 Gmail Integration: OAuth2 authentication for real email processing
🎯 High Accuracy: 95.0% F1-Score with XGBoost + RL model
🔄 Auto-Training: LOOCV and 5-fold cross-validation support
💨 State Preservation: Single Page App with zero reload times

🛠️ Prerequisites

Before setting up ContextCleanse, ensure you have:

Git - For version control
Docker & Docker Compose - For containerized deployment
Node.js 18+ - For frontend development
Python 3.11+ - For backend development
Ollama - For hosting LLMs

📁 Project Architecture

ContextCleanse/
├── 🖥️  frontend/                 # Next.js React SPA Frontend
│   ├── app/                      # App Router (Next.js 14)
│   │   ├── api/                  # API Routes
│   │   │   ├── classify-email/   # Email classification endpoint
│   │   │   ├── assistant/        # Assistant API endpoints
│   │   │   │   ├── chat/         # Ollama chat interface
│   │   │   │   ├── embeddings/   # Vector embedding generation
│   │   │   │   ├── models/       # Model management
│   │   │   │   ├── stream/       # Streaming responses
│   │   │   │   └── vector-db/    # In-memory vector database
│   │   │   ├── feedback/         # User feedback collection
│   │   │   ├── emails/           # Gmail API integration
│   │   │   └── reinforcement-learning/ # RL optimization endpoint
│   │   ├── components/           # Reusable UI components
│   │   │   ├── Sidebar.tsx      # State-based navigation
│   │   │   ├── NotificationSidebar.tsx # Unified event notifications
│   │   │   ├── OllamaSetup.tsx   # Ollama configuration
│   │   │   └── OllamaModelManager.tsx # Model management
│   │   ├── contexts/            # React Context providers
│   │   │   ├── NotificationContext.tsx # Global notification state
│   │   │   ├── AppNavigationContext.tsx # State-based routing
│   │   │   ├── PageLoadingContext.tsx # Loading state management
│   │   │   └── BackgroundInitializationContext.tsx # Background startup
│   │   ├── main/                # Single Page Application entry
│   │   ├── dashboard/           # Main dashboard interface
│   │   ├── assistant/           # Assistant with streaming RAG
│   │   ├── training/            # Model training with LOOCV
│   │   └── settings/            # Comprehensive settings
│   └── lib/                     # Utility libraries
│       ├── auth.ts              # NextAuth.js configuration
│       ├── gmail.ts             # Gmail API service
│       └── models.ts            # Model definitions and utilities
│
├── ⚙️  backend/                  # FastAPI Python Backend
│   ├── app/                     # Application core
│   │   ├── api/v1/endpoints/    # API endpoints
│   │   │   ├── spam.py          # Spam detection API
│   │   │   └── feedback.py      # RL feedback processing
│   │   ├── services/            # Business logic services
│   │   │   ├── ml_service.py    # ML model management + RL algorithms
│   │   │   ├── auth_service.py  # Authentication service
│   │   │   └── oauth_service.py # OAuth integration
│   │   ├── models/              # Database models
│   │   ├── schemas/             # Pydantic data schemas
│   │   └── core/                # Core configuration
│   │       ├── config.py        # Application settings
│   │       └── database.py      # Database connection
│   └── data/                    # ML training data
│
├── 📊 data/                     # Datasets and training data
│   ├── spambase/               # UCI Spambase dataset
│   │   ├── spambase.data       # Raw feature data (4,601 emails)
│   │   ├── spambase.names      # Feature descriptions
│   │   └── spambase.DOCUMENTATION # Dataset documentation
│   ├── ml_training/            # Training results and feedback
│   │   ├── training_results.json # Actual model performance metrics
│   │   └── user_feedback.json  # Real user feedback data (11+ samples)
│   └── Final Report.txt        # Comprehensive project report
│
├── 🗄️  database/               # Database initialization
│   └── init/                   # SQL initialization scripts
│
├── 📚 docs/                    # Documentation
│   ├── development.md          # Development setup guide
│   ├── ml-backend-integration.md # ML integration guide
│   └── oauth-setup-guide.md   # OAuth configuration
│
├── 🐳 docker-compose.yml       # Docker orchestration
├── 📋 .env                     # Environment configuration
└── 🧪 tests/                   # Test suites

🤖 Machine Learning Models

1. Traditional ML Models (6 Models)

Model	Algorithm	F1-Score	Use Case
Logistic Regression	Linear classification	88.6%	Fast, interpretable baseline
XGBoost	Gradient boosting trees	92.0%	High-performance ensemble
Naive Bayes	Probabilistic classifier	87.8%	Fast training, good for text
Neural Network (MLP)	Multi-layer perceptron	90.1%	Non-linear pattern detection
SVM	Support vector machines	89.1%	Strong generalization
Random Forest	Ensemble of decision trees	91.3%	Robust, handles overfitting

2. Advanced RL Model (3 Models)

🏆 XGBoost + RL (Default Best Model)

Base Algorithm: XGBoost trained on UCI Spambase dataset (4,601 emails)
RL Enhancement: Deep Q-Learning + Policy Gradient from user feedback
F1-Score: 95.0% (1.6% improvement over base XGBoost)
Learning Method: Continuous adaptation through user feedback on latest emails
Default Usage: Automatically selected for all email classifications
Training Source: UCI Spambase + Real user feedback from Gmail integration

🔬 Reinforcement Learning Implementation

RL Algorithms Used

1. Deep Q-Learning

# Q-Learning Update Rule
Q(s,a) = Q(s,a) + α[r + γ*max(Q(s',a')) - Q(s,a)]

# Implementation Features:
- State: 8-dimensional email feature vector
- Actions: {spam, ham} classification
- Reward: +1 (correct), -1 (incorrect)
- Q-Table: Discretized state-action values

2. Policy Gradient (REINFORCE)

# Policy Network Architecture
Input Layer (8 features) → Hidden Layer (16 neurons) → Output Layer (2 classes)

# Policy Update:
∇J(θ) = E[∇log π(a|s) * A(s,a)]

# Implementation Features:
- Neural network policy with tanh activation
- Advantage estimation using rewards
- Backpropagation through policy network

3. Experience Replay

# Experience Buffer
experience = {
    "state": email_features,
    "action": predicted_class,
    "reward": user_feedback,
    "next_state": updated_features
}

# Mini-batch Learning:
- Buffer size: 1000 experiences
- Batch size: 8 experiences
- Random sampling for stability

RL State Representation

The RL system converts email content into an 8-dimensional state vector:

Feature	Description	Range
`length_norm`	Normalized email length	[0, 1]
`word_density`	Words per character ratio	[0, 1]
`uppercase_ratio`	Uppercase character ratio	[0, 1]
`punctuation_ratio`	Punctuation density	[0, 1]
`url_density`	Number of URLs	[0, 1]
`email_density`	Number of email addresses	[0, 1]
`spam_words`	Presence of spam keywords	{0, 1}
`urgent_words`	Presence of urgent keywords	{0, 1}

📊 Dataset and Training

UCI Spambase Dataset

Source: University of California, Irvine ML Repository
Size: 4,601 email instances
Features: 57 numerical attributes
Classes: Spam (39.4%) vs Ham (60.6%)
Features Include:
- Word frequency percentages
- Character frequency percentages
- Capital letter statistics
- Length statistics

Training Process

Data Loading: Load UCI Spambase dataset from data/spambase/spambase.data
Cross-Validation: LOOCV (4,601 iterations) or 5-fold stratified cross-validation
Model Training: Train all 7 models with real performance tracking
Performance Evaluation: Calculate accuracy, precision, recall, F1-score with CV
RL Enhancement: Apply reinforcement learning with real user feedback (11+ samples)
Model Comparison: Rank models by actual F1-score performance
State Preservation: Background training with progress persistence

🔄 Real-time Learning Workflow

graph TD
    A[User Reviews Email] --> B[Provides Feedback]
    B --> C{Feedback Type}
    C -->|Correct| D[Reward = +1]
    C -->|Incorrect| E[Reward = -1]
    D --> F[Extract State Features]
    E --> F
    F --> G[Q-Learning Update]
    G --> H[Policy Gradient Update]
    H --> I[Experience Replay]
    I --> J[Update XGBoost + RL Model]
    J --> K[Improved Predictions]
    K --> A

🏗️ Technical Stack

Frontend Technologies

Framework: Next.js 14 (App Router) with Single Page Application architecture
Language: TypeScript
Styling: Tailwind CSS with custom scrollbar styling
UI Components: Lucide React icons
Authentication: NextAuth.js with Google OAuth and session management
State Management: React Context API with AppNavigationContext
HTTP Client: Native Fetch API with streaming support
Navigation: State-based routing with zero reload times

Backend Technologies

Framework: FastAPI (Python 3.9+)
ML Libraries:
- XGBoost 1.7+
- Scikit-learn 1.3+
- NumPy 1.24+
- Pandas 2.0+
Database: PostgreSQL with pgvector
Caching: Redis
Authentication: JWT tokens
Logging: Loguru

Infrastructure

Containerization: Docker + Docker Compose
Database: PostgreSQL 16 with vector extension
Caching: Redis 7
Networking: Custom Docker network
Environment: Production-ready configuration

🚀 Getting Started

Prerequisites

Docker & Docker Compose
Node.js 18+ (for local development)
Python 3.9+ (for local development)
Gmail API credentials (for email integration)

Quick Start

# 1. Clone the repository
git clone <repository-url>
cd ContextCleanse

# 2. Upgrade npm (Required)
npm install -g npm@latest
npm --version

# 3. Verify npm version in Docker
./scripts/verify-npm-version.sh

# 4. Set up environment variables
cp .env.example .env
# Edit .env with your configurations

# 5. Start the application
docker-compose up -d

# 6. Access the application
# Frontend: http://localhost:3000
# Backend API: http://localhost:8000
# API Documentation: http://localhost:8000/docs

Gmail Integration Setup

Go to Google Cloud Console
Create a new project or select an existing one
Enable Gmail API
Create OAuth 2.0 credentials
Add credentials to .env file
Configure authorized redirect URIs

📈 Performance Metrics

Model Comparison Results ⭐ XGBoost + RL is the Default Best Model

Model	Accuracy	Precision	Recall	F1-Score	Training Time	Status
🏆 XGBoost + RL	95.0%	95.0%	95.0%	95.0%	4.8s	Default Best
XGBoost	94.8%	93.2%	93.7%	93.4%	4.1s	Base Model
Random Forest	94.5%	94.8%	90.9%	92.8%	5.2s	Strong Alternative
Neural Network (MLP)	92.8%	90.5%	91.5%	91.0%	8.7s	Deep Learning
Logistic Regression	92.8%	91.8%	89.8%	90.8%	2.3s	Fast Baseline
Naive Bayes	77.6%	72.0%	70.8%	71.4%	1.2s	Probabilistic
SVM	70.6%	68.7%	46.6%	55.5%	3.8s	Support Vector

🎯 Why XGBoost + RL is Always the Best Choice:

UCI Spambase Training: Trained on the gold-standard 4,601 email dataset

Continuous Learning: Improves with every user feedback through reinforcement learning

Highest Base Performance: 95.0% F1-Score out of the box

Real-time Adaptation: Learns user preferences and email patterns

Production Ready: Handles both known spam patterns and evolving threats

Reinforcement Learning Metrics

Q-Learning Convergence: ~50 feedback samples
Policy Gradient Improvement: 1.3% F1-score gain
Experience Replay Efficiency: 85% sample reuse
User Adaptation Rate: Real-time (< 3 feedback samples)

🔧 API Endpoints

ML Backend API

# Spam Classification
POST /api/v1/spam/check
Content-Type: application/json
{
    "content": "Email content...",
    "sender": "[email protected]",
    "subject": "Email subject"
}

# Model Training
POST /api/v1/feedback/models/train
{
    "model_name": "xgboost_rl",
    "k_folds": 5,
    "use_rl_enhancement": true
}

# RL Optimization
POST /api/v1/feedback/reinforcement-learning/optimize
{
    "feedback_data": {...},
    "optimization_config": {
        "algorithm": "deep_q_learning",
        "learning_rate": 0.001,
        "exploration_rate": 0.1
    }
}

# Model Comparison
GET /api/v1/feedback/models/compare

Frontend API Routes

# Email Classification
POST /api/classify-email

# User Feedback
POST /api/feedback

# Email Sync
GET /api/emails?limit=100

# RL Optimization
POST /api/reinforcement-learning

🧪 Testing and Validation

Model Validation

Cross-Validation: 5-fold stratified CV
Hold-out Testing: 20% test set
Performance Metrics: Accuracy, Precision, Recall, F1-Score
Statistical Significance: McNemar's test for model comparison

RL Validation

A/B Testing: RL-enhanced vs base models
Online Learning: Continuous performance monitoring
Convergence Analysis: Q-value and policy convergence tracking
User Study: Real user feedback collection and analysis

🤝 Contributing

We welcome contributions! Please see our contributing guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🤖 Assistant with RAG Pipeline

Powered by Llama 3.1 8B

The Assistant feature integrates Ollama with Llama 3.1 8B model and a custom RAG (Retrieval-Augmented Generation) pipeline to provide intelligent, context-aware query answering based on your email data.

🚀 Quick Setup

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull Llama 3:8B model (recommended for WSL)
ollama pull llama3:8b

# 3. Start Ollama service
ollama serve

# 4. For WSL users (automated in Docker)
export OLLAMA_HOST=0.0.0.0:11434

🔍 RAG Pipeline Features

Vector Embeddings: 384-dimensional email content vectors
Semantic Search: Find relevant emails by meaning, not just keywords
Real-time Updates: Automatically refreshes as new emails arrive
Context-Aware Responses: Uses 3-5 most relevant emails to inform answers
Source Attribution: Shows which emails contributed to each response

📊 Example Queries

"Show me all emails from IBM this month"
"What are the main topics in my recent newsletters?"
"Find emails about project deadlines"
"Analyze my email patterns for productivity insights"
"Which senders email me most frequently?"

⚙️ Technical Architecture

Local Processing: All data stays on your system (privacy-first)
In-Memory Vector DB: Fast semantic search through email embeddings
Cosine Similarity: Precise content matching for retrieval
Hybrid Search: Combines semantic and metadata filtering
Session Management: Secure, user-specific data isolation

📈 Performance

Response Time: 3-15 seconds end-to-end with streaming
Memory Usage: 4-8GB for llama3:8b model
Embedding Speed: ~10ms per email
Search Speed: ~50ms for 200 emails
Navigation Speed: <100ms with state preservation
Classification Speed: ~150ms per email

For detailed setup instructions, see docs/assistant-rag-setup.md

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

UCI Machine Learning Repository for the Spambase dataset
XGBoost Community for the gradient boosting implementation
OpenAI for reinforcement learning research and methodologies
FastAPI Team for the web framework
Next.js Team for the React-based frontend framework

📞 Support

Documentation: Check the /docs folder for detailed guides
Issues: Report bugs via GitHub Issues
Discussions: Join our community discussions
Email: Contact the development team

Built with ❤️ using Machine Learning, Reinforcement Learning, and Modern Web Technologies

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
backend		backend
configs		configs
data		data
database/init		database/init
docs		docs
frontend		frontend
nginx		nginx
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

1ncompleteness/CC

Folders and files

Latest commit

History

Repository files navigation