Skip to content

fix(coordinator): fail pending build/spawn waits when daemon disconnects#1466

Open
Bhanudahiyaa wants to merge 3 commits intodora-rs:mainfrom
Bhanudahiyaa:feature/coordinator-watchdog-fast-fail
Open

fix(coordinator): fail pending build/spawn waits when daemon disconnects#1466
Bhanudahiyaa wants to merge 3 commits intodora-rs:mainfrom
Bhanudahiyaa:feature/coordinator-watchdog-fast-fail

Conversation

@Bhanudahiyaa
Copy link
Copy Markdown
Contributor

Closes: #1465

Problem

The coordinator watchdog currently removes dead daemon connections, but pending lifecycle waits can remain unresolved:

  • wait_for_build can stay blocked when a build was waiting for a disconnected daemon.
  • wait_for_spawn can stay blocked when spawn acknowledgements are pending from a disconnected daemon.

This creates long hangs (up to client RPC deadline) instead of fast, deterministic failure.

Why It Matters

This is a daemon ↔ coordinator state-consistency issue.
Once a daemon is known disconnected, lifecycle operations that depend on it are no longer satisfiable and should fail immediately. Fast failure improves operator feedback, retry logic, and
distributed robustness.

What This PR Changes

1) Centralized disconnect handling in coordinator

  • Added handle_daemon_disconnect(...) to coordinator core logic.
  • On disconnect, it:
    • removes the daemon from pending build result sets,
    • records a structured build error reason,
    • finalizes affected builds as failed and moves them to finished_builds,
    • removes the daemon from pending spawn result sets,
    • fails affected spawn waiters immediately.

2) Wired into watchdog timeout path

  • In Event::DaemonHeartbeatInterval, when a daemon is dropped for missing heartbeat, coordinator now also triggers pending lifecycle failure propagation (not just connection removal).

3) Wired into daemon-exit notification path

  • In CoordinatorNotify::daemon_exit, after removing the connection, coordinator now triggers the same failure propagation logic.

4) Regression test

Added coordinator regression test:

  • daemon_disconnect_fails_pending_waiters_immediately

The test validates:

  • pending build_result waiters resolve promptly with an error,
  • pending spawn_result waiters resolve promptly with an error,
  • build state transitions from running_builds to finished_builds.

Scope / Tradeoffs

  • No protocol/API shape changes.
  • No behavior change for healthy daemon flows.
  • Focused on correctness under failure paths only.

Files Changed

  • binaries/coordinator/src/lib.rs
  • binaries/coordinator/src/listener.rs
  • binaries/coordinator/src/server.rs
  • binaries/coordinator/Cargo.toml (test dependency)
  • Cargo.lock

Validation

  • cargo fmt --all
  • cargo check -p dora-coordinator
  • cargo test -p dora-coordinator daemon_disconnect_fails_pending_waiters_immediately -- --nocapture

@Bhanudahiyaa
Copy link
Copy Markdown
Contributor Author

@phil-opp

Maintainers: this PR is ready for focused review on coordinator failure path correctness.

Quick status:

  • Scope is intentionally narrow and independent (daemon-disconnect -> pending waiter failure propagation).
  • No protocol/API surface changes.
  • Regression test included for the exact failure mode.
  • Most CI checks are green; only CI / CLI Test (ubuntu-latest) is currently failing.

What to review:

  1. handle_daemon_disconnect(...) logic and state transitions:
    • pending build waiters -> failed build result -> moved to finished_builds
    • pending spawn waiters -> immediate spawn failure
  2. Hook points:
    • watchdog timeout path
    • daemon_exit notification path
  3. Error semantics:
    • whether disconnect reason/context is sufficient for operator debugging

Note on the failing CI job:

  • This PR changes coordinator-side lifecycle failure handling only.
  • No CLI code changed.
  • If helpful, I can inspect the ubuntu CLI test logs and post a triage comment to confirm whether this is unrelated/flaky vs an actual regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(coordinator): fail pending build/spawn waiters on daemon disconnect

1 participant