-
Notifications
You must be signed in to change notification settings - Fork 170
fix(3383): loss of pod causes stdio transport to stop #3384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3384 +/- ##
==========================================
+ Coverage 64.79% 64.98% +0.19%
==========================================
Files 375 375
Lines 36626 36630 +4
==========================================
+ Hits 23730 23804 +74
+ Misses 11024 10947 -77
- Partials 1872 1879 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical bug where the stdio transport enters a "zombie" state when Docker becomes unavailable during re-attachment. The transport would appear running (IsRunning() returns true) but could not communicate with the container, preventing automatic restart. The fix ensures proper shutdown by calling Stop() when re-attachment fails.
Changes:
- Added
Stop()call inprocessStdoutafter failed re-attachment to properly shut down the zombie transport - Refactored monitoring and cleanup logic in
Runner.Run()into testable helper functions (monitorTransportandhandleTransportStopped) - Added comprehensive test coverage for the zombie state bug and refactored monitoring/cleanup logic
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/transport/stdio.go | Added Stop() call after failed re-attachment to prevent zombie state |
| pkg/transport/stdio_test.go | Added TestTransport_ZombieStateWhenDockerUnavailable and updated existing tests with new mock expectations for Stop() and StopWorkload() calls |
| pkg/runner/runner.go | Extracted monitoring loop into monitorTransport and cleanup logic into handleTransportStopped for better testability |
| pkg/runner/runner_test.go | Added tests for monitorTransport and handleTransportStopped to verify monitoring and restart decision logic |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| // transportStoppedDeps holds the dependencies for handleTransportStopped. | ||
| // transportStoppedDeps holds the dependencies for handleTransportStopped. |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate comment found. The comment "transportStoppedDeps holds the dependencies for handleTransportStopped." appears on both lines 700 and 701. Remove the duplicate on line 701.
| // transportStoppedDeps holds the dependencies for handleTransportStopped. |
Summary
Fixes #3383
When the Docker daemon becomes unavailable while a stdio transport proxy is running, the transport would enter a "zombie" state where
IsRunning()returnstruebut it cannot communicate with the container. This prevented the automatic restart mechanism from triggering. The fix ensures the transport properly shuts down when re-attachment fails.Changes
The fix here is actually quite small, but I found the existing recovery code hard to understand. Consequently, I made slight refactors to backfill unit tests to document the actual relationship between the transport and the runner's monitoring behavior.
Fix: Call
Stop()after failed re-attachment (pkg/transport/stdio.go)Added a call to
t.Stop(ctx)inprocessStdoutafter re-attachment fails. Previously, the function would log "Container stdout closed - exiting read loop" and return without shutting down the transport, leaving it in a zombie state where:processMessagescontinued runningIsRunning()returnedtrueRefactor: Extract monitoring loop into
monitorTransport(pkg/runner/runner.go)Extracted the transport monitoring goroutine from
Run()into a standalonemonitorTransportfunction with atransportCheckerfunction type alias. This improves testability and reduces code in the mainRun()method.Refactor: Extract transport stopped handling into
handleTransportStopped(pkg/runner/runner.go)Extracted the
case <-doneChhandling logic into a testablehandleTransportStoppedfunction with atransportStoppedDepsstruct for dependency injection. This allows unit testing the restart decision logic without complex integration setup.Test updates (
pkg/transport/stdio_test.go)TestTransport_ZombieStateWhenDockerUnavailableto verify the fixTestProcessStdout_EOFWithFailedReattachment,TestConcurrentReattachment, andTestStdinRaceConditionwith additional mock expectations forStop()andStopWorkload()callsStop()is calledTesting
TestTransport_ZombieStateWhenDockerUnavailable- verifies transport shuts down properly when Docker is unavailable during re-attachmentTest_monitorTransport- verifies monitoring loop detects stopped transport, continues polling on errors, and respects context cancellationTestHandleTransportStopped- verifies restart decision logic (restart when workload exists, no restart when removed, restart on check failure)