Distributed Inference Key-Value Cache

Minh Nguyen, Zhenghui Gui

A high-performance distributed KV cache system for LLM inference, optimized for cloud deployment on Google Kubernetes Engine (GKE).

Overview

This system accelerates LLM inference by distributing key-value cache across multiple worker nodes using consistent hashing for locality. It provides:

10-50% faster inference through intelligent KV cache reuse
Horizontal scalability with automatic worker routing
GPU acceleration support for production workloads
Production-ready Kubernetes deployment on GKE

Architecture

Components

Gateway: FastAPI service handling external requests, routing via consistent hashing
Coordinator: Manages worker registry and sequence-to-worker mapping
Workers: StatefulSet pods running inference with distributed KV cache

Quick Start

Option 1: Using Dev Container (Recommended)

For a consistent, isolated development environment with all deployment tools pre-installed:

Open in Dev Container
- Install "Dev Containers" extension in VS Code
- Press F1 → "Dev Containers: Reopen in Container"
- Wait for container to build (2-5 minutes first time)

Deploy to GKE

# Authenticate with GCP
gcloud auth login
gcloud auth application-default login

# Run automated deployment
./scripts/quickstart_gke.sh

Optional Docker Config if failing to build images from inside the dev container
```
sudo chown root:docker /var/run/docker.sock
```

The dev container includes: gcloud, terraform, kubectl, docker, uv, and all necessary tools.

Option 2: Automated Deployment (Local Environment)

If you have gcloud, terraform, and kubectl installed locally:

# One command deployment
./scripts/quickstart_gke.sh

This interactive script will:

Configure your GCP project
Enable required APIs
Create GKE cluster with Terraform
Build and push Docker images
Deploy all services

Option 3: Manual Deployment

# Prerequisites: gcloud, terraform, kubectl installed
# (Or use dev container - see Option 1)

# 1. Configure GCP
export GCP_PROJECT_ID="your-project-id"
gcloud config set project $GCP_PROJECT_ID

# 2. Create infrastructure
cd infra
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings
terraform init
terraform apply

# 3. Build images
cd ..
./scripts/build_images.sh

# 4. Deploy to GKE
./scripts/deploy_gke.sh

See infra/DEPLOYMENT.md for detailed instructions.

Option 4: Local Development

# Run locally with Docker Compose (dev container or local)
./scripts/local_dev.sh

Development Environment

Using Dev Container (Recommended)

The dev container provides an isolated environment with all tools pre-installed:

Cloud Tools: gcloud CLI, kubectl, Terraform
Python Tools: Python 3.14, uv package manager
Container Tools: Docker CLI for building images
VS Code Extensions: Python, Kubernetes, Terraform, Cloud Code

Get Started:

Install "Dev Containers" extension in VS Code
Open project and select "Reopen in Container"
All deployment scripts work out of the box!

See .devcontainer/README.md for full documentation.

Performance Testing

The project includes comprehensive test suites:

cd tests

# Install dependencies
uv sync

# Run cache performance tests
uv run pytest test_cache_performance.py -v -s

# Run stress tests
uv run pytest test_stress.py -v -s

# Run all tests
uv run pytest -v -s

Test Categories

Routing Distribution - Validates consistent hashing and load balancing
Cache Locality - Verifies KV cache append and reuse behavior
Generation Flow - End-to-end inference with streaming
Cache Performance - Measures speedup with cache vs no-cache
Stress Testing - High concurrency, sustained load, burst traffic

Project Structure

distributed-kv-cache/
├── infra/                  # Terraform infrastructure
│   ├── main.tf                 # GKE cluster configuration
│   ├── variables.tf            # Infrastructure variables
│   └── DEPLOYMENT.md           # Detailed deployment guide
├── k8s/                    # Kubernetes manifests
│   ├── coordinator.yaml        # Coordinator deployment
│   ├── gateway.yaml            # Gateway + LoadBalancer + HPA
│   ├── worker.yaml             # Worker StatefulSet
│   └── namespace.yaml          # ConfigMap and namespace
├── services/               # Microservices
│   ├── coordinator/            # Consistent hashing coordinator
│   ├── gateway/                # API gateway
│   └── worker/                 # Inference worker with KV cache
├── scripts/                # Deployment scripts
│   ├── quickstart_gke.sh       # One-command deployment
│   ├── build_images.sh         # Build and push to GCR
│   ├── deploy_gke.sh           # Deploy to GKE
│   └── local_dev.sh            # Local development
└── tests/                 # Comprehensive test suite
    ├── test_cache_performance.py  # Cache vs no-cache
    ├── test_stress.py             # Load testing
    └── ...

Configuration

Infrastructure (Terraform)

Edit infra/terraform.tfvars:

project_id   = "your-gcp-project-id"
region       = "us-central1"
cluster_name = "distributed-kv-cache"

# Gateway autoscaling
gateway_min_nodes = 1
gateway_max_nodes = 5

# Worker configuration
worker_node_count = 3
worker_min_nodes  = 2
worker_max_nodes  = 10

# GPU settings (set enable_gpu=false for CPU-only)
enable_gpu          = true
worker_machine_type = "n1-standard-8"
gpu_type            = "nvidia-tesla-t4"
gpu_count           = 1

Kubernetes Resources

Gateway: 2 vCPU, 4GB RAM (autoscales 2-10 pods) Coordinator: 1 vCPU, 2GB RAM (1 pod) Worker: 4-8 vCPU, 16-24GB RAM, optional GPU (autoscales 2-10 pods)

Monitoring

# View pod status
kubectl get pods

# Check autoscaling
kubectl get hpa

# View logs
kubectl logs -f deployment/gateway
kubectl logs -f statefulset/worker

# Get cluster stats
GATEWAY_IP=$(kubectl get svc gateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl http://$GATEWAY_IP/stats

Testing the Deployment

# Get gateway IP
GATEWAY_IP=$(kubectl get svc gateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Health check
curl http://$GATEWAY_IP/health

# Generate text
curl -X POST http://$GATEWAY_IP/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The future of AI is",
    "max_tokens": 20,
    "temperature": 0.7,
    "model_name": "gpt2"
  }'

# Check statistics
curl http://$GATEWAY_IP/stats | jq

Security

Workload Identity for GCP service access
Private GKE cluster option available
Firewall rules for internal communication
SSL/TLS termination at load balancer (configure separately)

Cleanup

# Delete Kubernetes resources
kubectl delete -f k8s/

# Destroy infrastructure
cd infra
terraform destroy

Documentation

Technical Report - Academic paper detailing system design, implementation, and evaluation
Architecture Diagrams - System architecture and design
Deployment Guide - Complete GKE deployment walkthrough
Test Suite - Testing documentation

Contributing

This is a university project for COSC 6376 - Cloud Computing.

License

Educational use only - University of Houston, Fall 2025

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.devcontainer		.devcontainer
docs		docs
infra		infra
k8s		k8s
scripts		scripts
services		services
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Inference Key-Value Cache

Minh Nguyen, Zhenghui Gui

Overview

Architecture

Components

Quick Start

Option 1: Using Dev Container (Recommended)

Option 2: Automated Deployment (Local Environment)

Option 3: Manual Deployment

Option 4: Local Development

Development Environment

Using Dev Container (Recommended)

Performance Testing

Test Categories

Project Structure

Configuration

Infrastructure (Terraform)

Kubernetes Resources

Monitoring

Testing the Deployment

Security

Cleanup

Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

ndminhvn/distributed-kv-cache

Folders and files

Latest commit

History

Repository files navigation

Distributed Inference Key-Value Cache

Minh Nguyen, Zhenghui Gui

Overview

Architecture

Components

Quick Start

Option 1: Using Dev Container (Recommended)

Option 2: Automated Deployment (Local Environment)

Option 3: Manual Deployment

Option 4: Local Development

Development Environment

Using Dev Container (Recommended)

Performance Testing

Test Categories

Project Structure

Configuration

Infrastructure (Terraform)

Kubernetes Resources

Monitoring

Testing the Deployment

Security

Cleanup

Documentation

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages