-
Notifications
You must be signed in to change notification settings - Fork 603
OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082
Conversation
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/1/input |
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/2/input |
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/3/input |
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/4/input |
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/5/input |
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/6/input |
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/7/input |
1255998
to
5100d70
Compare
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/8/input |
Fix rapid restart failure in podman-etcd resource agent
Problem Statement
TNF (Two-Node Failover) clusters do not automatically recover from some etcd process crashes. When an etcd process is killed directly (bypassing Pacemaker's normal stop procedure), the cluster detects the failure via monitor operation and attempts stop→start recovery, but the start operation fails with:
This requires manual intervention (
pcs resource cleanup etcd
) to recover the cluster.Root Cause
During rapid restart scenarios (e.g., process crash recovery), Pacemaker's clone notification variables show resources in transitional states. Specifically, a resource can appear in both the
active
andstop
lists simultaneously:The podman-etcd agent was using a naive word count of
OCF_RESKEY_CRM_meta_notify_active_resource
, which doesn't account for resources being stopped. This caused the agent to see 2 active resources when it expected only 1 (the standalone leader), leading to startup failure.Solution
According to the Pacemaker documentation, during "Post-notification (stop) / Pre-notification (start)" transitions, the true active resource count must be calculated as:
Changes Made
Added
get_truly_active_resources_count()
helper function (lines 1032-1072):active_resource
that also appear instop_resource
Updated
active_resources_count
calculation inpodman_start
(line 1574):References
test/extended/two_node/tnf_recovery.go:173
-> To be merged