A comprehensive demo application for learning and practicing SRE, observability, and incident management concepts. This project simulates a record store e-commerce application with integrated observability tools for metrics, logs, and distributed tracing.
The KodeKloud Records Store application demonstrates a complete observability solution built on modern best practices. It serves as a hands-on learning environment for:
- Setting up comprehensive monitoring and observability
- Implementing distributed tracing in applications
- Designing effective alerting strategies
- Practicing incident response using real-world scenarios
- Learning SLO-based monitoring approaches
- Understanding Prometheus metrics best practices
This is a multi-component monolith application with the following components:
- FastAPI Web Service - Main application serving REST API endpoints
- Celery Background Worker - Asynchronous task processing (same codebase)
- PostgreSQL Database - Data persistence
- RabbitMQ - Message queue for background task distribution
- Prometheus - Metrics collection and storage
- Grafana - Visualization and dashboards
- Jaeger - Distributed tracing
- Loki - Log aggregation
- Fluent Bit - Log collection and forwarding
- AlertManager - Alert handling and notifications
- Blackbox Exporter - Synthetic monitoring
- Pushgateway - Metrics from batch jobs
graph TB
subgraph "KodeKloud Records Store Application"
Client[π€ Client] --> API[FastAPI Web Service<br/>Port: 8000]
API --> DB[(PostgreSQL<br/>Port: 5432)]
API --> MQ[RabbitMQ<br/>Port: 5672]
MQ --> Worker[Celery Worker<br/>Background Tasks]
Worker --> DB
end
subgraph "Observability Stack"
API --> Prometheus[π Prometheus<br/>Port: 9090]
API --> Jaeger[π Jaeger<br/>Port: 16686]
API --> Fluent[π Fluent Bit]
Worker --> Prometheus
Worker --> Jaeger
Worker --> Fluent
Fluent --> Loki[π Loki<br/>Port: 3100]
Prometheus --> Grafana[π Grafana<br/>Port: 3000]
Prometheus --> AlertManager[π¨ AlertManager<br/>Port: 9093]
Loki --> Grafana
Jaeger --> Grafana
Blackbox[π― Blackbox Exporter<br/>Port: 9115] -.-> API
Pushgateway[π€ Pushgateway<br/>Port: 9091] --> Prometheus
end
classDef app fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef obs fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef storage fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
class Client,API,Worker app
class Prometheus,Grafana,Jaeger,Loki,Fluent,AlertManager,Blackbox,Pushgateway obs
class DB,MQ storage
sequenceDiagram
participant C as Client
participant A as FastAPI App
participant D as Database
participant M as RabbitMQ
participant W as Celery Worker
Note over A,W: All requests traced with Jaeger & logged to Loki
C->>A: GET /products
A->>D: Query products
D-->>A: Product list
A-->>C: JSON response
C->>A: POST /checkout
A->>D: Create order
A->>M: Queue background task
A-->>C: Order confirmation
M->>W: Process order task
W->>D: Update inventory
W->>D: Process payment
W-->>M: Task complete
Note over A,W: Metrics exported to Prometheus
kodekloud-records-store-web-app/
βββ src/
β βββ api/
β β βββ main.py # FastAPI application entry point
β β βββ routes.py # API endpoints (products, orders, checkout)
β β βββ models.py # Database models (Product, Order)
β β βββ database.py # Database connection and session management
β β βββ worker.py # Celery background tasks
β β βββ telemetry.py # OpenTelemetry setup
β β βββ metrics.py # Prometheus metrics definitions (BEST PRACTICES)
β βββ requirements.txt # Python dependencies
βββ config/
β βββ monitoring/ # Observability configuration
β βββ prometheus.yml # Prometheus scrape config
β βββ alertmanager.yml # Alert routing rules
β βββ alert_rules.yml # Prometheus alerting rules
β βββ sli_rules.yml # SLI measurement rules
β βββ grafana-provisioning/ # Grafana dashboards & datasources
βββ deploy/
β βββ environments/ # Environment configuration
β βββ setup-local-env.sh # π§ Environment setup script
β βββ templates/ # Environment variable templates
β βββ env.dev.template
β βββ env.staging.template
β βββ env.prod.template
βββ scripts/
β βββ generate_logs.sh # Generate test log data
β βββ demo_request_correlation.sh # Demo request tracing
βββ docker-compose.yaml # π³ Complete stack definition
βββ Dockerfile # Application container image
βββ test_traffic.sh # π Generate test traffic
βββ black_box_monitor.sh # π Synthetic monitoring
- Docker Desktop (recommended) or Docker + Docker Compose
- Git
- curl (for testing)
git clone <your-repo-url>
cd kodekloud-records-store-web-app# Run the environment setup script (creates .env.dev with safe defaults)
./deploy/environments/setup-local-env.sh
# Verify the environment file was created
cat .env.devThe setup script creates a .env.dev file with these defaults:
- Database:
dev_user/dev_password_123 - Grafana:
admin/dev_admin_123 - Service name:
kodekloud-record-store-api-dev
# Start all services (application + observability)
docker-compose --env-file .env.dev up -d
# Check all services are running
docker-compose ps# Test the API
curl http://localhost:8000/
# Check metrics endpoint
curl http://localhost:8000/metrics
# Check health
curl http://localhost:8000/health| Service | URL | Credentials |
|---|---|---|
| Records Store API | http://localhost:8000 | N/A |
| API Documentation | http://localhost:8000/docs | N/A |
| Grafana Dashboards | http://localhost:3000 | admin / dev_admin_123 |
| Prometheus | http://localhost:9090 | N/A |
| Jaeger Tracing | http://localhost:16686 | N/A |
| Loki Logs | http://localhost:3100 | N/A |
| AlertManager | http://localhost:9093 | N/A |
| RabbitMQ Management | http://localhost:15672 | guest / guest |
# Generate test traffic (products, orders, errors)
./test_traffic.sh
# Generate logs for correlation testing
./scripts/generate_logs.sh
# Run synthetic monitoring
./black_box_monitor.sh# Basic endpoints
curl http://localhost:8000/ # Root
curl http://localhost:8000/health # Health check
curl http://localhost:8000/metrics # Prometheus metrics
# Observability testing endpoints
curl http://localhost:8000/trace-test # Generate test traces
curl http://localhost:8000/error-test # Generate test errors
# Business endpoints
curl http://localhost:8000/products # List products
curl -X POST http://localhost:8000/products \
-H "Content-Type: application/json" \
-d '{"name": "Abbey Road", "price": 25.99}' # Create product
curl http://localhost:8000/orders # List orders
curl -X POST http://localhost:8000/orders \
-H "Content-Type: application/json" \
-d '{"product_id": 1, "quantity": 2}' # Create order
curl -X POST http://localhost:8000/checkout \
-H "Content-Type: application/json" \
-d '{"product_id": 1, "quantity": 1}' # Checkout (triggers background tasks)- Four Golden Signals organization (Traffic, Latency, Errors, Saturation)
- Proper naming conventions with
kodekloud_prefix - Low cardinality design to avoid metric explosion
- Standard histogram buckets for latency measurements
- Business metrics for SLO tracking
- End-to-end request tracking through FastAPI β Database β Background Worker
- Trace correlation with logs and metrics
- Performance bottleneck identification
- Error propagation analysis
- JSON formatted logs with trace context
- Log correlation across services
- Centralized collection with Fluent Bit β Loki
- Service Level Indicators (SLIs) for reliability measurement
- Service Level Objectives (SLOs) with error budgets
- Alerting based on SLO violations not just symptoms
- Generate some test traffic:
./test_traffic.sh - Open Grafana (http://localhost:3000) and explore the dashboards
- Open Jaeger (http://localhost:16686) and trace a request end-to-end
- Check Prometheus (http://localhost:9090) and query some metrics
- Make a few API calls that will trigger errors
- Find the same request in metrics (Prometheus), logs (Loki), and traces (Jaeger)
- Use trace IDs to correlate between the three data sources
- Look at
src/api/metrics.pyto understand best practices - Add a new business metric (e.g.,
kodekloud_products_viewed_total) - Update
src/api/routes.pyto increment your metric - Rebuild and test: see your metric in http://localhost:8000/metrics
- Intentionally break something (modify code to cause errors)
- Use the observability tools to identify and diagnose the issue
- Practice following traces to find root causes
The setup-local-env.sh script creates these variables:
# Database Configuration
POSTGRES_HOST=db
POSTGRES_DB=kodekloud_records_dev
POSTGRES_USER=dev_user
POSTGRES_PASSWORD=dev_password_123
# Application Settings
DEBUG=true
LOG_LEVEL=DEBUG
ENVIRONMENT=development
# OpenTelemetry
OTEL_SERVICE_NAME=kodekloud-record-store-api-dev
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
# Grafana
GRAFANA_ADMIN_PASSWORD=dev_admin_123- Staging: Use
env.staging.template - Production: Use
env.prod.template
Copy and modify templates as needed:
cp deploy/environments/templates/env.staging.template .env.staging
# Edit .env.staging with your values
docker-compose --env-file .env.staging up -dServices not starting:
# Check for port conflicts
docker-compose ps
netstat -tulpn | grep -E ':(3000|8000|9090|5432)'
# Check Docker resources
docker system df
docker system prune # Clean up if neededNo metrics in Grafana:
# Verify Prometheus targets
curl http://localhost:9090/api/v1/targets
# Check API metrics endpoint
curl http://localhost:8000/metrics | grep kodekloud_No logs in Loki:
# Check Fluent Bit is running
docker-compose logs fluent-bit
# Test log endpoint
curl http://localhost:3100/readyNo traces in Jaeger:
# Check OpenTelemetry export
docker-compose logs jaeger
# Generate test traces
curl http://localhost:8000/trace-test- Check service logs:
docker-compose logs <service-name> - Verify environment:
cat .env.dev - Test connectivity: Use the curl commands above
- Reset everything:
docker-compose down -v && docker-compose --env-file .env.dev up -d
This project is designed for learning! Feel free to:
- Add new metrics following the patterns in
src/api/metrics.py - Create additional API endpoints in
src/api/routes.py - Improve dashboards in
config/monitoring/grafana-provisioning/ - Add new alerting rules in
config/monitoring/alert_rules.yml
This project is licensed under the MIT License - see the LICENSE file for details.