OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082

fonta-rh · 2025-10-14T10:10:47Z

Fix rapid restart failure in podman-etcd resource agent

Problem Statement

TNF (Two-Node Failover) clusters do not automatically recover from some etcd process crashes. When an etcd process is killed directly (bypassing Pacemaker's normal stop procedure), the cluster detects the failure via monitor operation and attempts stop→start recovery, but the start operation fails with:

ERROR: Unexpected active resource count: 2

This requires manual intervention (pcs resource cleanup etcd) to recover the cluster.

Root Cause

During rapid restart scenarios (e.g., process crash recovery), Pacemaker's clone notification variables show resources in transitional states. Specifically, a resource can appear in both the active and stop lists simultaneously:

notify: type=pre, operation=stop,
  active=[etcd:0 etcd:1],    ← Both marked active
  start=[etcd:1],             ← master-1 is starting
  stop=[etcd:1]               ← master-1 is also stopping

The podman-etcd agent was using a naive word count of OCF_RESKEY_CRM_meta_notify_active_resource, which doesn't account for resources being stopped. This caused the agent to see 2 active resources when it expected only 1 (the standalone leader), leading to startup failure.

Solution

According to the Pacemaker documentation, during "Post-notification (stop) / Pre-notification (start)" transitions, the true active resource count must be calculated as:

Active resources = $OCF_RESKEY_CRM_meta_notify_active_resource
                   minus $OCF_RESKEY_CRM_meta_notify_stop_resource

Changes Made

Added get_truly_active_resources_count() helper function (lines 1032-1072):
- Implements the Pacemaker-documented algorithm for calculating true active count
- Filters out resources from active_resource that also appear in stop_resource

Updated active_resources_count calculation in podman_start (line 1574):

# Before (BROKEN):
active_resources_count=$(echo "$OCF_RESKEY_CRM_meta_notify_active_resource" | wc -w)

# After (FIXED):
active_resources_count=$(get_truly_active_resources_count)

References

Bug Report: OCPBUGS-59238
Pacemaker Documentation: https://clusterlabs.org/projects/pacemaker/doc/2.1/Pacemaker_Administration/html/agents.html#interpretation-of-notification-variables
Test Case: "should recover from etcd process crash" in test/extended/two_node/tnf_recovery.go:173 -> To be merged

knet-jenkins · 2025-10-14T10:11:42Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/1/input

knet-jenkins · 2025-10-17T09:29:45Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/2/input

knet-jenkins · 2025-10-20T12:20:31Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/3/input

knet-jenkins · 2025-10-20T13:28:40Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/4/input

knet-jenkins · 2025-10-20T15:38:06Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/5/input

knet-jenkins · 2025-10-21T10:16:05Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/6/input

knet-jenkins · 2025-10-21T10:22:32Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/7/input

knet-jenkins · 2025-10-21T12:34:36Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/8/input

Redo counting of active_resources

d5b4428

oalbrigt changed the title ~~OCPBUGS-59238: Redo counting of active_resources to avoid bug on rapid etcd restart~~ OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart Oct 14, 2025

Reduce timeout to test

382b371

Add retriable check for force-new-cluster on joining as learner

586ead0

Move force_new_cluster check before podman simple check

5100d70

fonta-rh force-pushed the OCPBUGS-59238-fix-active-resource-count branch from 1255998 to 5100d70 Compare October 21, 2025 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082

OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082

Uh oh!

fonta-rh commented Oct 14, 2025 •

edited

Loading

Uh oh!

knet-jenkins bot commented Oct 14, 2025

Uh oh!

knet-jenkins bot commented Oct 17, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082

Are you sure you want to change the base?

OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart #2082

Uh oh!

Conversation

fonta-rh commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix rapid restart failure in podman-etcd resource agent

Problem Statement

Root Cause

Solution

Changes Made

References

Uh oh!

knet-jenkins bot commented Oct 14, 2025

Uh oh!

knet-jenkins bot commented Oct 17, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 20, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

knet-jenkins bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fonta-rh commented Oct 14, 2025 •

edited

Loading