Database Resilience Documentation

This document describes the database resilience features implemented in the Discogsography platform to handle nightly maintenance windows and other database outages.

Overview

All services now use resilient database connections that automatically handle:

Nightly database maintenance windows
Temporary network issues
Database restarts
Connection timeouts
Service unavailability

Key Features

1. Circuit Breaker Pattern

Each database connection uses a circuit breaker to prevent cascading failures:

Closed State: Normal operation, all requests pass through
Open State: After 5 consecutive failures, rejects requests immediately
Half-Open State: After 30-60 seconds, allows one test request

# Circuit breaker configuration
failure_threshold: 5  # Number of failures before opening
recovery_timeout: 30 - 60  # Seconds before trying half-open

2. Exponential Backoff

Failed connections retry with exponential backoff:

# Backoff configuration
initial_delay: 0.5-1.0    # Initial retry delay (seconds)
max_delay: 30-60         # Maximum retry delay
exponential_base: 2.0    # Delay multiplier
jitter: 25%              # Random jitter to prevent thundering herd

3. Connection Health Monitoring

PostgreSQL

Connection pool with 2-20 connections
Health checks every 30 seconds
Automatic removal of unhealthy connections
Maintains minimum connection pool size

Neo4j

Driver-level connection pooling (max 50 connections)
Built-in keep-alive mechanism
Session-level health checks
Automatic reconnection on SessionExpired

RabbitMQ

Robust connections with automatic recovery
Heartbeat monitoring (600 seconds)
Channel-level recovery
Publisher confirmations for reliability

4. Message Durability

During database outages:

Messages remain in RabbitMQ (persistent storage)
Failed messages are requeued with nack(requeue=True)
Idempotency prevents duplicates using SHA256 hashes
No data loss - messages wait until databases recover

Service-Specific Implementation

Python Services

Uses ResilientRabbitMQConnection for publishing
Buffers messages during connection issues
Retries failed publishes with backoff
Flushes pending messages on recovery

Graphinator Service (Neo4j)

Uses ResilientNeo4jDriver with automatic reconnection
Handles ServiceUnavailable and SessionExpired exceptions
Requeues messages on connection failures
Removed reactive 2-minute reconnection timer (now proactive)

Tableinator Service (PostgreSQL)

Uses ResilientPostgreSQLPool with health monitoring
Connection pool with min/max bounds (2-20)
Handles InterfaceError and OperationalError
Automatic connection recycling

Dashboard Service

Uses all three resilient connection types
Async implementations for non-blocking operations
Graceful degradation when services unavailable

Brainzgraphinator Service (Neo4j)

Uses ResilientNeo4jDriver with automatic reconnection
Enriches existing Neo4j nodes with MusicBrainz metadata
Handles ServiceUnavailable and SessionExpired exceptions
Requeues messages on connection failures

Brainztableinator Service (PostgreSQL)

Uses ResilientPostgreSQLPool with health monitoring
Stores MusicBrainz data in musicbrainz PostgreSQL schema
Connection pool with min/max bounds (2-20)
Handles InterfaceError and OperationalError

Configuration

Environment Variables

No changes required to existing environment variables. The resilient connections use the same configuration:

# Neo4j
NEO4J_HOST=neo4j
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password

# PostgreSQL
POSTGRES_HOST=postgres
POSTGRES_DATABASE=discogsography
POSTGRES_USERNAME=postgres
POSTGRES_PASSWORD=postgres

# RabbitMQ
RABBITMQ_HOST=rabbitmq
RABBITMQ_USERNAME=discogsography
RABBITMQ_PASSWORD=discogsography

Tuning Parameters

The following parameters can be adjusted in the code if needed:

# Circuit Breaker
failure_threshold = 5  # Failures before circuit opens
recovery_timeout = 30  # Seconds before recovery attempt

# Retry Settings
max_retries = 5  # Maximum connection attempts
initial_delay = 1.0  # Initial retry delay
max_delay = 60.0  # Maximum retry delay

# Connection Pools
postgres_min_connections = 2
postgres_max_connections = 20
postgres_health_check_interval = 30

# Neo4j Settings
neo4j_max_connection_lifetime = 1800  # 30 minutes
neo4j_max_connection_pool_size = 50
neo4j_connection_acquisition_timeout = 60.0

Behavior During Maintenance

When databases undergo nightly maintenance:

Connection Detection: Services detect connection loss within seconds
Circuit Breaker Opens: After 5 failures, prevents cascade
Message Queuing: New messages remain in RabbitMQ
Exponential Backoff: Retry attempts with increasing delays
Recovery: When database returns, connections automatically restore
Message Processing: Queued messages process in order
Idempotency: Duplicate prevention via SHA256 hashes

Monitoring

Health Endpoints

Each service exposes health data including connection status:

Extractor: http://localhost:8000/health
Graphinator: http://localhost:8001/health
Tableinator: http://localhost:8002/health
Dashboard: http://localhost:8003/health
API: http://localhost:8005/health
Explore: http://localhost:8007/health
Insights: http://localhost:8009/health
Brainztableinator: http://localhost:8010/health
Brainzgraphinator: http://localhost:8011/health

Logging

Enhanced logging for connection events:

🔄 Creating new connection (attempt 1/5)
⚠️ Connection attempt 1 failed: Connection refused. Retrying in 1.2 seconds...
🔄 Creating new connection (attempt 2/5)
✅ Connection established successfully
🚨 Circuit breaker OPEN after 5 failures
🔄 Circuit breaker entering HALF_OPEN state
✅ Circuit breaker reset to CLOSED

Metrics

The dashboard service (/metrics endpoint) provides Prometheus metrics for monitoring.

Testing Database Outages

To test the resilience features:

1. Stop a Database

# Stop Neo4j
docker-compose stop neo4j

# Stop PostgreSQL
docker-compose stop postgres

# Stop RabbitMQ
docker-compose stop rabbitmq

2. Observe Service Behavior

Watch the logs to see connection failures and circuit breaker activation:

docker-compose logs -f graphinator
docker-compose logs -f tableinator

3. Restart Database

# Restart the stopped service
docker-compose start neo4j
docker-compose start postgres
docker-compose start rabbitmq

4. Verify Recovery

Services should automatically reconnect
Queued messages should process
No data should be lost

Best Practices

Don't Panic: Services handle outages automatically
Monitor Logs: Watch for extended outage warnings
Check Queues: Monitor RabbitMQ queue depths during outages
Verify Recovery: Ensure message processing resumes after recovery
Test Regularly: Simulate outages in non-production environments

Troubleshooting

Services Not Recovering

If services don't recover after database restart:

Check circuit breaker state in logs
Verify database is fully started and accepting connections
Restart affected service if needed: docker-compose restart [service]

Messages Not Processing

If messages remain queued after recovery:

Check service health endpoints
Verify database connectivity manually
Look for poison messages causing repeated failures
Check dead letter queues for poison messages (each consumer has its own DLQ)

Performance Issues

If services are slow after recovery:

Check for message backlog in RabbitMQ
Monitor database connection pool usage
Consider increasing prefetch counts temporarily
Watch for circuit breaker flapping (rapid open/close)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Database Resilience Documentation

Overview

Key Features

1. Circuit Breaker Pattern

2. Exponential Backoff

3. Connection Health Monitoring

PostgreSQL

Neo4j

RabbitMQ

4. Message Durability

Service-Specific Implementation

Python Services

Graphinator Service (Neo4j)

Tableinator Service (PostgreSQL)

Dashboard Service

Brainzgraphinator Service (Neo4j)

Brainztableinator Service (PostgreSQL)

Configuration

Environment Variables

Tuning Parameters

Behavior During Maintenance

Monitoring

Health Endpoints

Logging

Metrics

Testing Database Outages

1. Stop a Database

2. Observe Service Behavior

3. Restart Database

4. Verify Recovery

Best Practices

Troubleshooting

Services Not Recovering

Messages Not Processing

Performance Issues

Uh oh!

FilesExpand file tree

database-resilience.md

Latest commit

History

database-resilience.md

File metadata and controls

Database Resilience Documentation

Overview

Key Features

1. Circuit Breaker Pattern

2. Exponential Backoff

3. Connection Health Monitoring

PostgreSQL

Neo4j

RabbitMQ

4. Message Durability

Service-Specific Implementation

Python Services

Graphinator Service (Neo4j)

Tableinator Service (PostgreSQL)

Dashboard Service

Brainzgraphinator Service (Neo4j)

Brainztableinator Service (PostgreSQL)

Configuration

Environment Variables

Tuning Parameters

Behavior During Maintenance

Monitoring

Health Endpoints

Logging

Metrics

Testing Database Outages

1. Stop a Database

2. Observe Service Behavior

3. Restart Database

4. Verify Recovery

Best Practices

Troubleshooting

Services Not Recovering

Messages Not Processing

Performance Issues