This document describes the database resilience features implemented in the Discogsography platform to handle nightly maintenance windows and other database outages.
All services now use resilient database connections that automatically handle:
- Nightly database maintenance windows
- Temporary network issues
- Database restarts
- Connection timeouts
- Service unavailability
Each database connection uses a circuit breaker to prevent cascading failures:
- Closed State: Normal operation, all requests pass through
- Open State: After 5 consecutive failures, rejects requests immediately
- Half-Open State: After 30-60 seconds, allows one test request
# Circuit breaker configuration
failure_threshold: 5 # Number of failures before opening
recovery_timeout: 30 - 60 # Seconds before trying half-openFailed connections retry with exponential backoff:
# Backoff configuration
initial_delay: 0.5-1.0 # Initial retry delay (seconds)
max_delay: 30-60 # Maximum retry delay
exponential_base: 2.0 # Delay multiplier
jitter: 25% # Random jitter to prevent thundering herd- Connection pool with 2-20 connections
- Health checks every 30 seconds
- Automatic removal of unhealthy connections
- Maintains minimum connection pool size
- Driver-level connection pooling (max 50 connections)
- Built-in keep-alive mechanism
- Session-level health checks
- Automatic reconnection on SessionExpired
- Robust connections with automatic recovery
- Heartbeat monitoring (600 seconds)
- Channel-level recovery
- Publisher confirmations for reliability
During database outages:
- Messages remain in RabbitMQ (persistent storage)
- Failed messages are requeued with
nack(requeue=True) - Idempotency prevents duplicates using SHA256 hashes
- No data loss - messages wait until databases recover
- Uses
ResilientRabbitMQConnectionfor publishing - Buffers messages during connection issues
- Retries failed publishes with backoff
- Flushes pending messages on recovery
- Uses
ResilientNeo4jDriverwith automatic reconnection - Handles
ServiceUnavailableandSessionExpiredexceptions - Requeues messages on connection failures
- Removed reactive 2-minute reconnection timer (now proactive)
- Uses
ResilientPostgreSQLPoolwith health monitoring - Connection pool with min/max bounds (2-20)
- Handles
InterfaceErrorandOperationalError - Automatic connection recycling
- Uses all three resilient connection types
- Async implementations for non-blocking operations
- Graceful degradation when services unavailable
- Uses
ResilientNeo4jDriverwith automatic reconnection - Enriches existing Neo4j nodes with MusicBrainz metadata
- Handles
ServiceUnavailableandSessionExpiredexceptions - Requeues messages on connection failures
- Uses
ResilientPostgreSQLPoolwith health monitoring - Stores MusicBrainz data in
musicbrainzPostgreSQL schema - Connection pool with min/max bounds (2-20)
- Handles
InterfaceErrorandOperationalError
No changes required to existing environment variables. The resilient connections use the same configuration:
# Neo4j
NEO4J_HOST=neo4j
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password
# PostgreSQL
POSTGRES_HOST=postgres
POSTGRES_DATABASE=discogsography
POSTGRES_USERNAME=postgres
POSTGRES_PASSWORD=postgres
# RabbitMQ
RABBITMQ_HOST=rabbitmq
RABBITMQ_USERNAME=discogsography
RABBITMQ_PASSWORD=discogsographyThe following parameters can be adjusted in the code if needed:
# Circuit Breaker
failure_threshold = 5 # Failures before circuit opens
recovery_timeout = 30 # Seconds before recovery attempt
# Retry Settings
max_retries = 5 # Maximum connection attempts
initial_delay = 1.0 # Initial retry delay
max_delay = 60.0 # Maximum retry delay
# Connection Pools
postgres_min_connections = 2
postgres_max_connections = 20
postgres_health_check_interval = 30
# Neo4j Settings
neo4j_max_connection_lifetime = 1800 # 30 minutes
neo4j_max_connection_pool_size = 50
neo4j_connection_acquisition_timeout = 60.0When databases undergo nightly maintenance:
- Connection Detection: Services detect connection loss within seconds
- Circuit Breaker Opens: After 5 failures, prevents cascade
- Message Queuing: New messages remain in RabbitMQ
- Exponential Backoff: Retry attempts with increasing delays
- Recovery: When database returns, connections automatically restore
- Message Processing: Queued messages process in order
- Idempotency: Duplicate prevention via SHA256 hashes
Each service exposes health data including connection status:
- Extractor:
http://localhost:8000/health - Graphinator:
http://localhost:8001/health - Tableinator:
http://localhost:8002/health - Dashboard:
http://localhost:8003/health - API:
http://localhost:8005/health - Explore:
http://localhost:8007/health - Insights:
http://localhost:8009/health - Brainztableinator:
http://localhost:8010/health - Brainzgraphinator:
http://localhost:8011/health
Enhanced logging for connection events:
🔄 Creating new connection (attempt 1/5)
⚠️ Connection attempt 1 failed: Connection refused. Retrying in 1.2 seconds...
🔄 Creating new connection (attempt 2/5)
✅ Connection established successfully
🚨 Circuit breaker OPEN after 5 failures
🔄 Circuit breaker entering HALF_OPEN state
✅ Circuit breaker reset to CLOSED
The dashboard service (/metrics endpoint) provides Prometheus metrics for monitoring.
To test the resilience features:
# Stop Neo4j
docker-compose stop neo4j
# Stop PostgreSQL
docker-compose stop postgres
# Stop RabbitMQ
docker-compose stop rabbitmqWatch the logs to see connection failures and circuit breaker activation:
docker-compose logs -f graphinator
docker-compose logs -f tableinator# Restart the stopped service
docker-compose start neo4j
docker-compose start postgres
docker-compose start rabbitmq- Services should automatically reconnect
- Queued messages should process
- No data should be lost
- Don't Panic: Services handle outages automatically
- Monitor Logs: Watch for extended outage warnings
- Check Queues: Monitor RabbitMQ queue depths during outages
- Verify Recovery: Ensure message processing resumes after recovery
- Test Regularly: Simulate outages in non-production environments
If services don't recover after database restart:
- Check circuit breaker state in logs
- Verify database is fully started and accepting connections
- Restart affected service if needed:
docker-compose restart [service]
If messages remain queued after recovery:
- Check service health endpoints
- Verify database connectivity manually
- Look for poison messages causing repeated failures
- Check dead letter queues for poison messages (each consumer has its own DLQ)
If services are slow after recovery:
- Check for message backlog in RabbitMQ
- Monitor database connection pool usage
- Consider increasing prefetch counts temporarily
- Watch for circuit breaker flapping (rapid open/close)