Skip to content

Fix race condition causing a spurious promote during a global DCS outage#20

Merged
avandras merged 1 commit intomultisitefrom
bugfix/stale-leader-observation
Mar 3, 2026
Merged

Fix race condition causing a spurious promote during a global DCS outage#20
avandras merged 1 commit intomultisitefrom
bugfix/stale-leader-observation

Conversation

@ants
Copy link

@ants ants commented Feb 25, 2026

Fallback leader observation mechanism was using a non-quorum read that can see a stale value of multisite status. For purposes of rewinding stadbys this is fine, but if timing was wrong it caused the main HA loop to observe the stale value and promote.

Fix this by only running the leader observation fallback while the node is not a leader. If the node is leader the regular heartbeat will take care of updating the view.

Observing a stale value during startup is not a problem because promoting to local leader will force a write to global DCS via resolve_leader().

Fallback leader observation mechanism was using a non-quorum read that
can see a stale value of multisite status. For purposes of rewinding
stadbys this is fine, but if timing was wrong it caused the main HA loop
to observe the stale value and promote.

Fix this by only running the leader observation fallback while the node
is not a leader. If the node is leader the regular heartbeat will take
care of updating the view.

Observing a stale value during startup is not a problem because
promoting to local leader will force a write to global DCS via
resolve_leader().
@avandras avandras merged commit c420cd9 into multisite Mar 3, 2026
36 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants