fix(coordinator): fail pending build/spawn waits when daemon disconnects#1466
Open
Bhanudahiyaa wants to merge 3 commits intodora-rs:mainfrom
Open
fix(coordinator): fail pending build/spawn waits when daemon disconnects#1466Bhanudahiyaa wants to merge 3 commits intodora-rs:mainfrom
Bhanudahiyaa wants to merge 3 commits intodora-rs:mainfrom
Conversation
Contributor
Author
|
Maintainers: this PR is ready for focused review on coordinator failure path correctness. Quick status:
What to review:
Note on the failing CI job:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes: #1465
Problem
The coordinator watchdog currently removes dead daemon connections, but pending lifecycle waits can remain unresolved:
wait_for_buildcan stay blocked when a build was waiting for a disconnected daemon.wait_for_spawncan stay blocked when spawn acknowledgements are pending from a disconnected daemon.This creates long hangs (up to client RPC deadline) instead of fast, deterministic failure.
Why It Matters
This is a daemon ↔ coordinator state-consistency issue.
Once a daemon is known disconnected, lifecycle operations that depend on it are no longer satisfiable and should fail immediately. Fast failure improves operator feedback, retry logic, and
distributed robustness.
What This PR Changes
1) Centralized disconnect handling in coordinator
handle_daemon_disconnect(...)to coordinator core logic.finished_builds,2) Wired into watchdog timeout path
Event::DaemonHeartbeatInterval, when a daemon is dropped for missing heartbeat, coordinator now also triggers pending lifecycle failure propagation (not just connection removal).3) Wired into daemon-exit notification path
CoordinatorNotify::daemon_exit, after removing the connection, coordinator now triggers the same failure propagation logic.4) Regression test
Added coordinator regression test:
daemon_disconnect_fails_pending_waiters_immediatelyThe test validates:
build_resultwaiters resolve promptly with an error,spawn_resultwaiters resolve promptly with an error,running_buildstofinished_builds.Scope / Tradeoffs
Files Changed
binaries/coordinator/src/lib.rsbinaries/coordinator/src/listener.rsbinaries/coordinator/src/server.rsbinaries/coordinator/Cargo.toml(test dependency)Cargo.lockValidation
cargo fmt --allcargo check -p dora-coordinatorcargo test -p dora-coordinator daemon_disconnect_fails_pending_waiters_immediately -- --nocapture