Skip to content

Conversation

fonta-rh
Copy link

@fonta-rh fonta-rh commented Oct 14, 2025

Fix rapid restart failure in podman-etcd resource agent

Problem Statement

TNF (Two-Node Failover) clusters do not automatically recover from some etcd process crashes. When an etcd process is killed directly (bypassing Pacemaker's normal stop procedure), the cluster detects the failure via monitor operation and attempts stop→start recovery, but the start operation fails with:

ERROR: Unexpected active resource count: 2

This requires manual intervention (pcs resource cleanup etcd) to recover the cluster.

Root Cause

During rapid restart scenarios (e.g., process crash recovery), Pacemaker's clone notification variables show resources in transitional states. Specifically, a resource can appear in both the active and stop lists simultaneously:

notify: type=pre, operation=stop,
  active=[etcd:0 etcd:1],    ← Both marked active
  start=[etcd:1],             ← master-1 is starting
  stop=[etcd:1]               ← master-1 is also stopping

The podman-etcd agent was using a naive word count of OCF_RESKEY_CRM_meta_notify_active_resource, which doesn't account for resources being stopped. This caused the agent to see 2 active resources when it expected only 1 (the standalone leader), leading to startup failure.

Solution

According to the Pacemaker documentation, during "Post-notification (stop) / Pre-notification (start)" transitions, the true active resource count must be calculated as:

Active resources = $OCF_RESKEY_CRM_meta_notify_active_resource
                   minus $OCF_RESKEY_CRM_meta_notify_stop_resource

Changes Made

  1. Added get_truly_active_resources_count() helper function (lines 1032-1072):

    • Implements the Pacemaker-documented algorithm for calculating true active count
    • Filters out resources from active_resource that also appear in stop_resource
  2. Updated active_resources_count calculation in podman_start (line 1574):

    # Before (BROKEN):
    active_resources_count=$(echo "$OCF_RESKEY_CRM_meta_notify_active_resource" | wc -w)
    
    # After (FIXED):
    active_resources_count=$(get_truly_active_resources_count)

References

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 14, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/1/input

@oalbrigt oalbrigt changed the title OCPBUGS-59238: Redo counting of active_resources to avoid bug on rapid etcd restart OCPBUGS-59238: podman-etcd: Redo counting of active_resources to avoid bug on rapid etcd restart Oct 14, 2025
@knet-jenkins
Copy link

knet-jenkins bot commented Oct 17, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/2/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 20, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/3/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 20, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/4/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 20, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/5/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 21, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/6/input

@knet-jenkins
Copy link

knet-jenkins bot commented Oct 21, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/7/input

@fonta-rh fonta-rh force-pushed the OCPBUGS-59238-fix-active-resource-count branch from 1255998 to 5100d70 Compare October 21, 2025 12:33
@knet-jenkins
Copy link

knet-jenkins bot commented Oct 21, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2082/8/input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant