Skip to content

fix: fail-fast on missing or invalid DCR_ENCRYPTION_KEY#60

Merged
luis5tb merged 3 commits intoRHEcosystemAppEng:mainfrom
luis5tb:dcr-enc-key
Mar 31, 2026
Merged

fix: fail-fast on missing or invalid DCR_ENCRYPTION_KEY#60
luis5tb merged 3 commits intoRHEcosystemAppEng:mainfrom
luis5tb:dcr-enc-key

Conversation

@luis5tb
Copy link
Copy Markdown
Collaborator

@luis5tb luis5tb commented Mar 27, 2026

Previously, _encrypt_secret() silently generated an ephemeral Fernet key when DCR_ENCRYPTION_KEY was not configured. Secrets encrypted with an ephemeral key become permanently unrecoverable after a container restart.

  • Raise RuntimeError in _encrypt_secret() when no Fernet cipher is available instead of generating an ephemeral key
  • Raise ValueError in DCRService.init() when DCR_ENCRYPTION_KEY is missing on Cloud Run (K_SERVICE is set)
  • Raise ValueError on invalid DCR_ENCRYPTION_KEY at construction time instead of logging and continuing with _fernet=None
  • Add a stable Fernet key to test conftest so DCR tests have a valid encryption key available

@IlonaShishov
Copy link
Copy Markdown
Collaborator

Code Review

This PR addresses a critical security vulnerability where client secrets could be encrypted with ephemeral Fernet keys. The validation logic is sound, but there's a critical gap in production safety.

🔴 Critical Issue: Lazy Initialization Bypass

Problem: DCRService validation only runs when the first DCR request arrives, not at startup.

Current Flow:

Cloud Run starts → DB init → Health probes pass → Service marked READY ✅
                                                        ↓
                                    (First DCR request arrives)
                                                        ↓
                                    get_dcr_service() called
                                                        ↓
                                    DCRService.__init__() runs
                                                        ↓
                                    💥 ValueError raised (too late!)

Root Cause:

  • get_dcr_service() uses lazy initialization (singleton pattern)
  • Only called from marketplace/router.py:81 when handling DCR requests
  • marketplace/app.py lifespan doesn't initialize the DCR service
  • Cloud Run startup probes pass before validation runs

Impact: Service appears healthy but fails on first DCR use. Missing/invalid DCR_ENCRYPTION_KEY is not caught until runtime.

✅ Required Fix: Eager Validation at Startup

Add DCR initialization to src/lightspeed_agent/marketplace/app.py lifespan:

@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
    """Application lifespan manager for startup/shutdown events."""
    settings = get_settings()

    # Startup: Initialize database
    try:
        from lightspeed_agent.db import init_database

        logger.info("Initializing database: %s", settings.database_url.split("@")[-1])
        await init_database()
        logger.info("Database initialized successfully")
    except Exception as e:
        logger.error("Failed to initialize database: %s", e)
        raise

    # Startup: Validate DCR configuration (fail-fast on Cloud Run)
    # This ensures DCR_ENCRYPTION_KEY is valid BEFORE the service becomes ready,
    # preventing silent failures when the first DCR request arrives.
    try:
        from lightspeed_agent.dcr import get_dcr_service

        logger.info("Validating DCR service configuration")
        get_dcr_service()  # Triggers DCRService.__init__() validation
        logger.info("DCR service initialized successfully")
    except Exception as e:
        logger.error("Failed to initialize DCR service: %s", e)
        raise

    yield

    # Shutdown: Close database connection
    try:
        from lightspeed_agent.db import close_database

        logger.info("Closing database connection")
        await close_database()
    except Exception as e:
        logger.error("Failed to close database: %s", e)

Why this works:

  • Triggers DCRService.__init__() validation before Cloud Run health probes
  • Container fails to start if DCR_ENCRYPTION_KEY is missing/invalid
  • Cloud Run never marks service as READY with broken config

🟡 Recommended: Add Test Coverage

Add tests for the new error conditions in tests/test_dcr.py:

@pytest.mark.asyncio
async def test_dcr_service_missing_key_on_cloud_run(monkeypatch, db_session):
    """Test that DCRService raises ValueError when DCR_ENCRYPTION_KEY is missing on Cloud Run."""
    monkeypatch.setenv("K_SERVICE", "test-marketplace-handler")
    monkeypatch.delenv("DCR_ENCRYPTION_KEY", raising=False)
    
    with pytest.raises(ValueError, match="DCR_ENCRYPTION_KEY is required in production"):
        DCRService()

@pytest.mark.asyncio
async def test_dcr_service_invalid_encryption_key(monkeypatch, db_session):
    """Test that DCRService raises ValueError for invalid DCR_ENCRYPTION_KEY."""
    monkeypatch.setenv("DCR_ENCRYPTION_KEY", "not-a-valid-fernet-key")
    
    with pytest.raises(ValueError, match="Invalid DCR_ENCRYPTION_KEY"):
        DCRService()

@pytest.mark.asyncio
async def test_encrypt_secret_without_key_raises(service):
    """Test that _encrypt_secret raises RuntimeError when _fernet is None."""
    service._fernet = None
    
    with pytest.raises(RuntimeError, match="Cannot encrypt client secret"):
        service._encrypt_secret("test-secret")

Verification

Cloud Run deployment config: ✅ Verified

  • DCR_ENCRYPTION_KEY sourced from Secret Manager (dcr-encryption-key)
  • No optional: true flag (unlike K8s manifest)

Recommendation

Approve with required fix:

  1. 🔴 REQUIRED: Add eager DCR initialization to lifespan (above)
  2. 🟡 RECOMMENDED: Add test coverage for error paths

Priority: CRITICAL - Without eager initialization, Cloud Run can start with invalid config.


With the eager initialization fix, this PR will provide complete air-tight protection against missing/invalid encryption keys in Cloud Run production deployments.

IlonaShishov
IlonaShishov previously approved these changes Mar 29, 2026
Copy link
Copy Markdown
Collaborator

@IlonaShishov IlonaShishov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,
please consider if worth upgrading to eager fail-fast strategy suggested by Claude.

luis5tb and others added 2 commits March 30, 2026 08:12
Previously, _encrypt_secret() silently generated an ephemeral Fernet key
when DCR_ENCRYPTION_KEY was not configured. Secrets encrypted with an
ephemeral key become permanently unrecoverable after a container restart.

- Raise RuntimeError in _encrypt_secret() when no Fernet cipher is
  available instead of generating an ephemeral key
- Raise ValueError in DCRService.__init__() when DCR_ENCRYPTION_KEY is
  missing on Cloud Run (K_SERVICE is set)
- Raise ValueError on invalid DCR_ENCRYPTION_KEY at construction time
  instead of logging and continuing with _fernet=None
- Add a stable Fernet key to test conftest so DCR tests have a valid
  encryption key available

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move DCR service initialization from lazy (first-request) to eager
(lifespan startup) so Cloud Run fails to start with missing/invalid
DCR_ENCRYPTION_KEY instead of appearing healthy and failing on first
DCR request. Add test coverage for encryption key error paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@luis5tb
Copy link
Copy Markdown
Collaborator Author

luis5tb commented Mar 30, 2026

LGTM, please consider if worth upgrading to eager fail-fast strategy suggested by Claude.

added! thanks!

@dmartinol
Copy link
Copy Markdown
Collaborator

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@luis5tb
Copy link
Copy Markdown
Collaborator Author

luis5tb commented Mar 31, 2026

/retest

@luis5tb luis5tb merged commit 418f490 into RHEcosystemAppEng:main Mar 31, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants