fix(3383): loss of pod causes stdio transport to stop #3384

jerm-dro · 2026-01-21T19:45:49Z

Summary

When the Docker daemon becomes unavailable while a stdio transport proxy is running, the transport would enter a "zombie" state where IsRunning() returns true but it cannot communicate with the container. This prevented the automatic restart mechanism from triggering. The fix ensures the transport properly shuts down when re-attachment fails.

Changes

The fix here is actually quite small, but I found the existing recovery code hard to understand. Consequently, I made slight refactors to backfill unit tests to document the actual relationship between the transport and the runner's monitoring behavior.

Fix: Call `Stop()` after failed re-attachment (`pkg/transport/stdio.go`)

Added a call to t.Stop(ctx) in processStdout after re-attachment fails. Previously, the function would log "Container stdout closed - exiting read loop" and return without shutting down the transport, leaving it in a zombie state where:

processMessages continued running
The HTTP proxy continued accepting requests
IsRunning() returned true
The monitoring loop never detected the failure

Refactor: Extract monitoring loop into `monitorTransport` (`pkg/runner/runner.go`)

Extracted the transport monitoring goroutine from Run() into a standalone monitorTransport function with a transportChecker function type alias. This improves testability and reduces code in the main Run() method.

Refactor: Extract transport stopped handling into `handleTransportStopped` (`pkg/runner/runner.go`)

Extracted the case <-doneCh handling logic into a testable handleTransportStopped function with a transportStoppedDeps struct for dependency injection. This allows unit testing the restart decision logic without complex integration setup.

Test updates (`pkg/transport/stdio_test.go`)

Added TestTransport_ZombieStateWhenDockerUnavailable to verify the fix
Updated TestProcessStdout_EOFWithFailedReattachment, TestConcurrentReattachment, and TestStdinRaceCondition with additional mock expectations for Stop() and StopWorkload() calls
Updated test assertions to reflect the new behavior where stdin/stdout are nil after Stop() is called

Testing

TestTransport_ZombieStateWhenDockerUnavailable - verifies transport shuts down properly when Docker is unavailable during re-attachment
Test_monitorTransport - verifies monitoring loop detects stopped transport, continues polling on errors, and respects context cancellation
TestHandleTransportStopped - verifies restart decision logic (restart when workload exists, no restart when removed, restart on check failure)
All existing transport and runner tests pass
Manual e2e test with real docker outage

Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>

codecov · 2026-01-21T19:54:24Z

Codecov Report

❌ Patch coverage is 51.92308% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.98%. Comparing base (6a18d9d) to head (bdc5ca7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/runner/runner.go	54.00%	20 Missing and 3 partials ⚠️
pkg/transport/stdio.go	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3384      +/-   ##
==========================================
+ Coverage   64.79%   64.98%   +0.19%     
==========================================
  Files         375      375              
  Lines       36626    36630       +4     
==========================================
+ Hits        23730    23804      +74     
+ Misses      11024    10947      -77     
- Partials     1872     1879       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR fixes a critical bug where the stdio transport enters a "zombie" state when Docker becomes unavailable during re-attachment. The transport would appear running (IsRunning() returns true) but could not communicate with the container, preventing automatic restart. The fix ensures proper shutdown by calling Stop() when re-attachment fails.

Changes:

Added Stop() call in processStdout after failed re-attachment to properly shut down the zombie transport
Refactored monitoring and cleanup logic in Runner.Run() into testable helper functions (monitorTransport and handleTransportStopped)
Added comprehensive test coverage for the zombie state bug and refactored monitoring/cleanup logic

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
pkg/transport/stdio.go	Added `Stop()` call after failed re-attachment to prevent zombie state
pkg/transport/stdio_test.go	Added `TestTransport_ZombieStateWhenDockerUnavailable` and updated existing tests with new mock expectations for `Stop()` and `StopWorkload()` calls
pkg/runner/runner.go	Extracted monitoring loop into `monitorTransport` and cleanup logic into `handleTransportStopped` for better testability
pkg/runner/runner_test.go	Added tests for `monitorTransport` and `handleTransportStopped` to verify monitoring and restart decision logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T19:55:01Z

pkg/runner/runner.go

+}
+
+// transportStoppedDeps holds the dependencies for handleTransportStopped.
+// transportStoppedDeps holds the dependencies for handleTransportStopped.


Duplicate comment found. The comment "transportStoppedDeps holds the dependencies for handleTransportStopped." appears on both lines 700 and 701. Remove the duplicate on line 701.

Suggested change

// transportStoppedDeps holds the dependencies for handleTransportStopped.

fix(3383): loss of pod causes stdio transport to stop

bdc5ca7

Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>

github-actions bot added the size/M Medium PR: 300-599 lines changed label Jan 21, 2026

jerm-dro requested a review from Copilot January 21, 2026 19:50

Copilot started reviewing on behalf of jerm-dro January 21, 2026 19:50 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(3383): loss of pod causes stdio transport to stop #3384

fix(3383): loss of pod causes stdio transport to stop #3384

jerm-dro commented Jan 21, 2026

Uh oh!

codecov bot commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(3383): loss of pod causes stdio transport to stop #3384

Are you sure you want to change the base?

fix(3383): loss of pod causes stdio transport to stop #3384

Conversation

jerm-dro commented Jan 21, 2026

Summary

Changes

Fix: Call Stop() after failed re-attachment (pkg/transport/stdio.go)

Refactor: Extract monitoring loop into monitorTransport (pkg/runner/runner.go)

Refactor: Extract transport stopped handling into handleTransportStopped (pkg/runner/runner.go)

Test updates (pkg/transport/stdio_test.go)

Testing

Uh oh!

codecov bot commented Jan 21, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Call `Stop()` after failed re-attachment (`pkg/transport/stdio.go`)

Refactor: Extract monitoring loop into `monitorTransport` (`pkg/runner/runner.go`)

Refactor: Extract transport stopped handling into `handleTransportStopped` (`pkg/runner/runner.go`)

Test updates (`pkg/transport/stdio_test.go`)