Skip to content

Conversation

@jerm-dro
Copy link
Contributor

Summary

Fixes #3383

When the Docker daemon becomes unavailable while a stdio transport proxy is running, the transport would enter a "zombie" state where IsRunning() returns true but it cannot communicate with the container. This prevented the automatic restart mechanism from triggering. The fix ensures the transport properly shuts down when re-attachment fails.

Changes

The fix here is actually quite small, but I found the existing recovery code hard to understand. Consequently, I made slight refactors to backfill unit tests to document the actual relationship between the transport and the runner's monitoring behavior.

Fix: Call Stop() after failed re-attachment (pkg/transport/stdio.go)

Added a call to t.Stop(ctx) in processStdout after re-attachment fails. Previously, the function would log "Container stdout closed - exiting read loop" and return without shutting down the transport, leaving it in a zombie state where:

  • processMessages continued running
  • The HTTP proxy continued accepting requests
  • IsRunning() returned true
  • The monitoring loop never detected the failure

Refactor: Extract monitoring loop into monitorTransport (pkg/runner/runner.go)

Extracted the transport monitoring goroutine from Run() into a standalone monitorTransport function with a transportChecker function type alias. This improves testability and reduces code in the main Run() method.

Refactor: Extract transport stopped handling into handleTransportStopped (pkg/runner/runner.go)

Extracted the case <-doneCh handling logic into a testable handleTransportStopped function with a transportStoppedDeps struct for dependency injection. This allows unit testing the restart decision logic without complex integration setup.

Test updates (pkg/transport/stdio_test.go)

  • Added TestTransport_ZombieStateWhenDockerUnavailable to verify the fix
  • Updated TestProcessStdout_EOFWithFailedReattachment, TestConcurrentReattachment, and TestStdinRaceCondition with additional mock expectations for Stop() and StopWorkload() calls
  • Updated test assertions to reflect the new behavior where stdin/stdout are nil after Stop() is called

Testing

  • TestTransport_ZombieStateWhenDockerUnavailable - verifies transport shuts down properly when Docker is unavailable during re-attachment
  • Test_monitorTransport - verifies monitoring loop detects stopped transport, continues polling on errors, and respects context cancellation
  • TestHandleTransportStopped - verifies restart decision logic (restart when workload exists, no restart when removed, restart on check failure)
  • All existing transport and runner tests pass
  • Manual e2e test with real docker outage

Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Jan 21, 2026
@jerm-dro jerm-dro requested a review from Copilot January 21, 2026 19:50
@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 51.92308% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.98%. Comparing base (6a18d9d) to head (bdc5ca7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/runner/runner.go 54.00% 20 Missing and 3 partials ⚠️
pkg/transport/stdio.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3384      +/-   ##
==========================================
+ Coverage   64.79%   64.98%   +0.19%     
==========================================
  Files         375      375              
  Lines       36626    36630       +4     
==========================================
+ Hits        23730    23804      +74     
+ Misses      11024    10947      -77     
- Partials     1872     1879       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug where the stdio transport enters a "zombie" state when Docker becomes unavailable during re-attachment. The transport would appear running (IsRunning() returns true) but could not communicate with the container, preventing automatic restart. The fix ensures proper shutdown by calling Stop() when re-attachment fails.

Changes:

  • Added Stop() call in processStdout after failed re-attachment to properly shut down the zombie transport
  • Refactored monitoring and cleanup logic in Runner.Run() into testable helper functions (monitorTransport and handleTransportStopped)
  • Added comprehensive test coverage for the zombie state bug and refactored monitoring/cleanup logic

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/transport/stdio.go Added Stop() call after failed re-attachment to prevent zombie state
pkg/transport/stdio_test.go Added TestTransport_ZombieStateWhenDockerUnavailable and updated existing tests with new mock expectations for Stop() and StopWorkload() calls
pkg/runner/runner.go Extracted monitoring loop into monitorTransport and cleanup logic into handleTransportStopped for better testability
pkg/runner/runner_test.go Added tests for monitorTransport and handleTransportStopped to verify monitoring and restart decision logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

// transportStoppedDeps holds the dependencies for handleTransportStopped.
// transportStoppedDeps holds the dependencies for handleTransportStopped.
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate comment found. The comment "transportStoppedDeps holds the dependencies for handleTransportStopped." appears on both lines 700 and 701. Remove the duplicate on line 701.

Suggested change
// transportStoppedDeps holds the dependencies for handleTransportStopped.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] stdio: transport enters zombie state when re-attachment fails

2 participants