🐛 Fix health check to support partial probe results with multi-probe configuration #342

zhujian7 · 2025-10-24T09:04:28Z

Summary

This PR fixes issue #341 where the health check incorrectly marks addons as Unknown when using multiple health check probes and the first probe returns empty feedback results.

Problem:
When multiple health check ProbeFields are configured (e.g., checking both deployment and OLM), if the first probe returns empty feedback results, the code immediately returns and sets the addon status to Unknown. This prevents the evaluation of subsequent probes that may have valid results.

Solution:

Modified the logic to collect all empty probe fields instead of returning immediately
Only mark the addon as Unknown if ALL probes have no results
Allow partial probe results to be considered valid (if at least one probe returns results)
Removed redundant check for non-empty results in healthChecker path

Changes:

Added emptyProbeFields slice to track probes with no results
Changed early return to continue when a probe has no results
Added check after the loop to determine if all probes are empty
Improved error messaging to differentiate between single vs multiple empty probes
Added comprehensive test case for the multi-probe scenario

Related issue(s)

Fixes #341

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Health checks now aggregate results from multiple probes before deciding availability, avoiding immediate "unknown" reports and allowing partial probe success to mark an add-on available.
- Consolidated empty-probe handling into a single post-check status update with a clearer message.
Tests
- Added a test validating multi-probe aggregation when initial probes are empty but later probes succeed.
Integration
- Updated integration expectation: unprobeable add-ons now treated as unavailable (False) instead of Unknown.

…configuration When multiple health check probes are configured, if the first probe returns empty feedback results, the health check was incorrectly marking the addon as Unknown and returning early, preventing other probes from being evaluated. This fix changes the logic to: - Collect all empty probe fields instead of returning immediately - Only mark the addon as Unknown if ALL probes have no results - Allow partial probe results to be considered valid - Remove redundant check for non-empty results after healthChecker Fixes open-cluster-management-io#341 Signed-off-by: Claude <[email protected]> Signed-off-by: zhujian <[email protected]>

openshift-ci · 2025-10-24T09:04:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhujian7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [zhujian7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2025-10-24T09:04:48Z

Walkthrough

Aggregate multi-probe results in probeAddonStatusByWorks instead of early-returning on empty probe feedback; if all probes are empty set Available=Unknown, otherwise evaluate using collected results. Tests updated to cover multi-probe partial-empty scenarios and an integration expectation change.

Changes

Cohort / File(s)	Summary
Probe aggregation logic `pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go`	Collect probe results into a local `fieldResults` slice for all probes, defer empty-probe decision until after loop, and set ManagedClusterAddOnConditionAvailable=Unknown only if every probe produced no results. Adjusted healthChecker invocation to use aggregated `fieldResults`.
Unit tests (healthcheck) `pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go`	Added a test case for HealthCheckModeCustomized with a multi-probe prober where the first probe returns empty and the second returns valid results; expects Available=True with probe-available reason and message.
Integration test (agent deploy) `test/integration/kube/agent_deploy_test.go`	Changed expected ManagedClusterAddOn Available condition when the addon cannot be probed from `Unknown` to `False` and updated related assertion/comment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

🌱 Deprecate work healthCheck use healthChecker instead #324 — modifies the same probe aggregation/health-check invocation logic and related tests.

Suggested labels

lgtm

Suggested reviewers

qiujian16
zhiweiyin318
elgnay

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Out of Scope Changes Check	❓ Inconclusive	The changes to healthcheck_sync.go and healthcheck_sync_test.go are clearly in scope for fixing issue #341 regarding multi-probe health check behavior. However, the modification to agent_deploy_test.go warrants clarification: the test expectation is changed from ConditionUnknown to ConditionFalse when the addon cannot be probed. The PR description and objectives state that the addon should be marked Unknown only when ALL probes have no results, but this test change appears to contradict that by expecting False instead of Unknown. The PR documentation does not explicitly explain the rationale for this behavioral change, making it unclear whether this is a necessary consequence of the refactored logic or an out-of-scope modification.	Please clarify why the agent_deploy_test.go expectation was changed from ConditionUnknown to ConditionFalse. If this change is intentional and related to the multi-probe fix, update the PR description to explain the behavioral change and ensure it aligns with the stated objective that addons should only be marked Unknown when all probes return no results. If this change is unrelated to the multi-probe fix, consider removing it from this PR.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "🐛 Fix health check to support partial probe results with multi-probe configuration" is directly related to the main change in the changeset. The title clearly identifies this as a bug fix (indicated by the 🐛 emoji) and specifically describes the core issue being addressed: enabling partial probe results support in multi-probe configurations. The title is concise, avoids vague language or noise, and a developer scanning commit history would immediately understand the primary change is about fixing health check behavior for multiple probes.
Linked Issues Check	✅ Passed	The code changes directly address all coding requirements from issue #341. The modifications to healthcheck_sync.go implement the core solution: collecting results from all probes instead of returning immediately when the first probe has empty feedback, tracking empty probes with a new emptyProbeFields slice, and only setting the addon status to Unknown when ALL configured probes return no results. The new test case in healthcheck_sync_test.go validates the multi-probe scenario where the first probe returns empty results while the second returns valid results, confirming the addon Available condition is correctly set to True. This aligns with the expected behavior stated in issue #341 that at least one probe with valid results should result in a True available condition.
Description Check	✅ Passed	The pull request description follows the repository template and includes all required sections. It contains a comprehensive Summary section that explains the problem (immediate return on first empty probe), the solution (defer Unknown decision until all probes are checked), and a detailed list of changes made. The Related issue(s) section properly links to issue #341 with "Fixes #341". The description provides sufficient context for reviewers to understand the fix, including the use case of multiple health check probes where some may return empty results while others have valid feedback.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (3)
202-206: Minor style: local var should be lowerCamel.

Rename FieldResults to fieldResults for idiomatic Go naming. Optional.

233-251: All-empty check and messaging look right; tiny UX tweak optional.

Logic is sound. Consider listing up to N missing identifiers (e.g., first 3) before falling back to a count to avoid very long messages in large configs. Optional.

252-265: Guard healthChecker call when no probe fields are configured.

If probeFields == 0, FieldResults stays empty and the all-empty check doesn’t trigger; healthChecker would receive an empty slice. Verify your healthChecker handles that. If not, add a small guard.

Proposed minimal guard:
@@
- if healthChecker != nil {
+ // Avoid calling healthChecker with no inputs when no probe fields are configured.
+ if len(probeFields) == 0 {
+   meta.SetStatusCondition(&addon.Status.Conditions, metav1.Condition{
+     Type:    addonapiv1alpha1.ManagedClusterAddOnConditionAvailable,
+     Status:  metav1.ConditionUnknown,
+     Reason:  addonapiv1alpha1.AddonAvailableReasonNoProbeResult,
+     Message: "No probe fields are configured for this add-on",
+   })
+   return nil
+ }
+ if healthChecker != nil {
   if err := healthChecker(FieldResults, cluster, addon); err != nil {
To confirm current behavior, please check utils.DeploymentAvailabilityHealthChecker and utils.WorkloadAvailabilityHealthChecker for empty-slice handling.
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (1)

1262-1329: Add a complementary all-empty multi-probe test.

Recommend adding a test where all configured probes return no feedback to assert:

Available condition Unknown

Reason AddonAvailableReasonNoProbeResult

Optionally, message uses the plural path.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2f5c52 and 38c5bec.

📒 Files selected for processing (2)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (2 hunks)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (3)

vendor/open-cluster-management.io/api/work/v1/types.go (1)

ResourceIdentifier (391-411)

vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (2)

ManagedClusterAddOnConditionAvailable (317-317)

AddonAvailableReasonNoProbeResult (369-369)

vendor/k8s.io/api/core/v1/types.go (1)

ConditionUnknown (3092-3092)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (3)

pkg/addonmanager/addontesting/helpers.go (1)

NewAddonWithConditions (90-100)

vendor/open-cluster-management.io/api/work/v1/types.go (11)

ManifestWork (18-28)

ManifestWorkSpec (37-57)

ManifestWorkStatus (465-479)

ManifestResourceStatus (483-492)

ManifestCondition (530-542)

ManifestResourceMeta (417-445)

StatusFeedbackResult (545-551)

FeedbackValue (553-565)

FieldValue (569-590)

Integer (596-596)

WorkAvailable (503-503)

vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (3)

HealthCheckModeCustomized (218-218)

ManagedClusterAddOnConditionAvailable (317-317)

AddonAvailableReasonProbeAvailable (377-377)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: build
GitHub Check: e2e-hosted
GitHub Check: unit
GitHub Check: e2e
GitHub Check: integration
GitHub Check: e2e-cloudevents
GitHub Check: verify
GitHub Check: verify-deps

🔇 Additional comments (2)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (1)

191-201: Good fix: defer Unknown and continue scanning probes.

Collecting empty probe fields and continuing instead of early-returning correctly enables partial-result handling across multi-probe configs.

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (1)

1262-1329: Nice coverage for partial results across multi-probe.

This test nails the regression: empty first probe no longer masks valid later probes.

Simplify the probe result collection logic by: - Collecting all probe results upfront, regardless of healthChecker - Checking for empty results based on FieldResults length - Removing unnecessary tracking of empty probe fields - Updating test expectations for multi-probe scenarios This makes the code cleaner and maintains the fix for supporting partial probe results with multi-probe configurations. Signed-off-by: Jia Zhu <[email protected]> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: zhujian <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (1)
192-198: Key fix: Continue checking remaining probes instead of early return.

This correctly addresses the core issue. When one probe returns no results, the code now continues to check other probes rather than immediately marking the addon as Unknown.

The comment at line 195 could be slightly clearer:
-		// if no results are returned. it is possible that work agent has not returned the feedback value.
-		// collect these fields and check if all probes are empty later
+		// If no results are returned, skip this probe field. It's possible the work agent
+		// has not returned feedback yet. We'll check if all probes are empty after the loop.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38c5bec and b2750f5.

📒 Files selected for processing (2)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (2 hunks)
test/integration/kube/agent_deploy_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

test/integration/kube/agent_deploy_test.go (1)

vendor/k8s.io/api/core/v1/types.go (1)

ConditionFalse (3091-3091)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (4)

pkg/agent/inteface.go (1)

FieldResult (205-211)

vendor/open-cluster-management.io/api/work/v1/types.go (1)

ResourceIdentifier (391-411)

vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (2)

ManagedClusterAddOnConditionAvailable (317-317)

AddonAvailableReasonNoProbeResult (369-369)

vendor/k8s.io/api/core/v1/types.go (1)

ConditionUnknown (3092-3092)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: e2e
GitHub Check: integration
GitHub Check: unit
GitHub Check: e2e-cloudevents
GitHub Check: e2e-hosted
GitHub Check: verify

🔇 Additional comments (5)

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (4)

190-190: LGTM: Variable renamed to follow Go conventions.

The rename from FieldResults to fieldResults follows Go naming conventions for local variables.

200-229: LGTM: Correctly accumulates probe results for aggregated health checking.

The logic properly collects results from all probes that return feedback:

Accumulates results into fieldResults for later processing by healthChecker

Maintains backward compatibility with the deprecated healthCheck function

Continues to next probe when healthChecker is set, enabling multi-probe aggregation

231-240: Correctly handles the all-empty-probes case.

This post-loop check properly implements the fix: only mark the addon as Unknown when all configured probes return no results. If at least one probe returns feedback, the code proceeds to health checking, allowing partial probe results to be considered valid.

242-255: LGTM: Supports partial probe results with clear documentation.

The comments clearly explain the intent, and the healthChecker now correctly receives aggregated results from all probes that returned feedback. This enables the health checker to make informed decisions based on partial probe results.

test/integration/kube/agent_deploy_test.go (1)

925-926: LGTM: Test expectation correctly updated to reflect new behavior.

The change from ConditionUnknown to ConditionFalse is correct. Under the new logic:

The deployment probe returns results (albeit with incorrect feedback field "ReplicasTest")

The daemonset probe returns no results

Since at least one probe has results, the code no longer sets Unknown early

The healthChecker is invoked with the deployment's results

The health checker fails because the deployment feedback is invalid and the daemonset is not probed

Result: Available=False instead of Unknown

This properly validates the fix for partial probe results.

zhujian7 · 2025-10-24T11:50:18Z

/cc @zhiweiyin318 @qiujian16

zhiweiyin318 · 2025-10-24T14:41:25Z

/hold
/lgtm

zhiweiyin318 · 2025-10-24T14:43:59Z

feel free to help review this. cc @tesshuflower

tesshuflower · 2025-10-24T16:19:28Z

/lgtm

I've also done some tests in my env with this fix and it's working well - Thanks @zhujian7 and @zhiweiyin318 for such a quick fix!

openshift-ci · 2025-10-24T16:19:31Z

@tesshuflower: changing LGTM is restricted to collaborators

In response to this:

/lgtm

I've also done some tests in my env with this fix and it's working well - Thanks @zhujian7 and @zhiweiyin318 for such a quick fix!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

qiujian16 · 2025-10-25T00:56:28Z

do we need to cherry-pick to release-1.1?

zhujian7 · 2025-10-27T01:50:24Z

/cherrypick release-1.1

openshift-cherrypick-robot · 2025-10-27T01:50:27Z

@zhujian7: once the present PR merges, I will cherry-pick it on top of release-1.1 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

zhujian7 · 2025-10-27T07:25:27Z

/unhold

openshift-cherrypick-robot · 2025-10-27T07:30:15Z

@zhujian7: new pull request created: #343

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot requested review from qiujian16 and zhiweiyin318 October 24, 2025 09:04

openshift-ci bot added the approved label Oct 24, 2025

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

zhujian7 force-pushed the fix-multi-probe-empty-results branch from f7e3724 to b2750f5 Compare October 24, 2025 10:15

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold label Oct 24, 2025

openshift-ci bot assigned zhiweiyin318 Oct 24, 2025

openshift-ci bot added the lgtm label Oct 24, 2025

openshift-ci bot removed the do-not-merge/hold label Oct 27, 2025

zhujian7 mentioned this pull request Oct 27, 2025

support different ProbeFields in WorkHealthProber #287

Closed

openshift-merge-bot bot merged commit 6e81408 into open-cluster-management-io:main Oct 27, 2025
15 checks passed

openshift-cherrypick-robot mentioned this pull request Oct 27, 2025

[release-1.1] 🐛 Fix health check to support partial probe results with multi-probe configuration #343

Merged

🐛 Fix health check to support partial probe results with multi-probe configuration #342

🐛 Fix health check to support partial probe results with multi-probe configuration #342

Uh oh!

Conversation

zhujian7 commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related issue(s)

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Oct 24, 2025

Uh oh!

coderabbitai bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

zhujian7 commented Oct 24, 2025

Uh oh!

zhiweiyin318 commented Oct 24, 2025

Uh oh!

zhiweiyin318 commented Oct 24, 2025

Uh oh!

tesshuflower commented Oct 24, 2025

Uh oh!

openshift-ci bot commented Oct 24, 2025

Uh oh!

qiujian16 commented Oct 25, 2025

Uh oh!

zhujian7 commented Oct 27, 2025

Uh oh!

openshift-cherrypick-robot commented Oct 27, 2025

Uh oh!

zhujian7 commented Oct 27, 2025

Uh oh!

Uh oh!

openshift-cherrypick-robot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhujian7 commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 24, 2025 •

edited

Loading