A high-performance distributed KV cache system for LLM inference, optimized for cloud deployment on Google Kubernetes Engine (GKE).
This system accelerates LLM inference by distributing key-value cache across multiple worker nodes using consistent hashing for locality. It provides:
- 10-50% faster inference through intelligent KV cache reuse
- Horizontal scalability with automatic worker routing
- GPU acceleration support for production workloads
- Production-ready Kubernetes deployment on GKE
- Gateway: FastAPI service handling external requests, routing via consistent hashing
- Coordinator: Manages worker registry and sequence-to-worker mapping
- Workers: StatefulSet pods running inference with distributed KV cache
For a consistent, isolated development environment with all deployment tools pre-installed:
-
Open in Dev Container
- Install "Dev Containers" extension in VS Code
- Press
F1→ "Dev Containers: Reopen in Container" - Wait for container to build (2-5 minutes first time)
-
Deploy to GKE
# Authenticate with GCP gcloud auth login gcloud auth application-default login # Run automated deployment ./scripts/quickstart_gke.sh
-
Optional Docker Config if failing to build images from inside the dev container
sudo chown root:docker /var/run/docker.sock
The dev container includes: gcloud, terraform, kubectl, docker, uv, and all necessary tools.
If you have gcloud, terraform, and kubectl installed locally:
# One command deployment
./scripts/quickstart_gke.shThis interactive script will:
- Configure your GCP project
- Enable required APIs
- Create GKE cluster with Terraform
- Build and push Docker images
- Deploy all services
# Prerequisites: gcloud, terraform, kubectl installed
# (Or use dev container - see Option 1)
# 1. Configure GCP
export GCP_PROJECT_ID="your-project-id"
gcloud config set project $GCP_PROJECT_ID
# 2. Create infrastructure
cd infra
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings
terraform init
terraform apply
# 3. Build images
cd ..
./scripts/build_images.sh
# 4. Deploy to GKE
./scripts/deploy_gke.shSee infra/DEPLOYMENT.md for detailed instructions.
# Run locally with Docker Compose (dev container or local)
./scripts/local_dev.shThe dev container provides an isolated environment with all tools pre-installed:
- Cloud Tools: gcloud CLI, kubectl, Terraform
- Python Tools: Python 3.14, uv package manager
- Container Tools: Docker CLI for building images
- VS Code Extensions: Python, Kubernetes, Terraform, Cloud Code
Get Started:
- Install "Dev Containers" extension in VS Code
- Open project and select "Reopen in Container"
- All deployment scripts work out of the box!
See .devcontainer/README.md for full documentation.
The project includes comprehensive test suites:
cd tests
# Install dependencies
uv sync
# Run cache performance tests
uv run pytest test_cache_performance.py -v -s
# Run stress tests
uv run pytest test_stress.py -v -s
# Run all tests
uv run pytest -v -s- Routing Distribution - Validates consistent hashing and load balancing
- Cache Locality - Verifies KV cache append and reuse behavior
- Generation Flow - End-to-end inference with streaming
- Cache Performance - Measures speedup with cache vs no-cache
- Stress Testing - High concurrency, sustained load, burst traffic
distributed-kv-cache/
├── infra/ # Terraform infrastructure
│ ├── main.tf # GKE cluster configuration
│ ├── variables.tf # Infrastructure variables
│ └── DEPLOYMENT.md # Detailed deployment guide
├── k8s/ # Kubernetes manifests
│ ├── coordinator.yaml # Coordinator deployment
│ ├── gateway.yaml # Gateway + LoadBalancer + HPA
│ ├── worker.yaml # Worker StatefulSet
│ └── namespace.yaml # ConfigMap and namespace
├── services/ # Microservices
│ ├── coordinator/ # Consistent hashing coordinator
│ ├── gateway/ # API gateway
│ └── worker/ # Inference worker with KV cache
├── scripts/ # Deployment scripts
│ ├── quickstart_gke.sh # One-command deployment
│ ├── build_images.sh # Build and push to GCR
│ ├── deploy_gke.sh # Deploy to GKE
│ └── local_dev.sh # Local development
└── tests/ # Comprehensive test suite
├── test_cache_performance.py # Cache vs no-cache
├── test_stress.py # Load testing
└── ...
Edit infra/terraform.tfvars:
project_id = "your-gcp-project-id"
region = "us-central1"
cluster_name = "distributed-kv-cache"
# Gateway autoscaling
gateway_min_nodes = 1
gateway_max_nodes = 5
# Worker configuration
worker_node_count = 3
worker_min_nodes = 2
worker_max_nodes = 10
# GPU settings (set enable_gpu=false for CPU-only)
enable_gpu = true
worker_machine_type = "n1-standard-8"
gpu_type = "nvidia-tesla-t4"
gpu_count = 1Gateway: 2 vCPU, 4GB RAM (autoscales 2-10 pods) Coordinator: 1 vCPU, 2GB RAM (1 pod) Worker: 4-8 vCPU, 16-24GB RAM, optional GPU (autoscales 2-10 pods)
# View pod status
kubectl get pods
# Check autoscaling
kubectl get hpa
# View logs
kubectl logs -f deployment/gateway
kubectl logs -f statefulset/worker
# Get cluster stats
GATEWAY_IP=$(kubectl get svc gateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl http://$GATEWAY_IP/stats# Get gateway IP
GATEWAY_IP=$(kubectl get svc gateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Health check
curl http://$GATEWAY_IP/health
# Generate text
curl -X POST http://$GATEWAY_IP/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 20,
"temperature": 0.7,
"model_name": "gpt2"
}'
# Check statistics
curl http://$GATEWAY_IP/stats | jq- Workload Identity for GCP service access
- Private GKE cluster option available
- Firewall rules for internal communication
- SSL/TLS termination at load balancer (configure separately)
# Delete Kubernetes resources
kubectl delete -f k8s/
# Destroy infrastructure
cd infra
terraform destroy- Technical Report - Academic paper detailing system design, implementation, and evaluation
- Architecture Diagrams - System architecture and design
- Deployment Guide - Complete GKE deployment walkthrough
- Test Suite - Testing documentation
This is a university project for COSC 6376 - Cloud Computing.
Educational use only - University of Houston, Fall 2025