Skip to content

Conversation

@zhujian7
Copy link
Member

@zhujian7 zhujian7 commented Oct 24, 2025

Summary

This PR fixes issue #341 where the health check incorrectly marks addons as Unknown when using multiple health check probes and the first probe returns empty feedback results.

Problem:
When multiple health check ProbeFields are configured (e.g., checking both deployment and OLM), if the first probe returns empty feedback results, the code immediately returns and sets the addon status to Unknown. This prevents the evaluation of subsequent probes that may have valid results.

Solution:

  • Modified the logic to collect all empty probe fields instead of returning immediately
  • Only mark the addon as Unknown if ALL probes have no results
  • Allow partial probe results to be considered valid (if at least one probe returns results)
  • Removed redundant check for non-empty results in healthChecker path

Changes:

  1. Added emptyProbeFields slice to track probes with no results
  2. Changed early return to continue when a probe has no results
  3. Added check after the loop to determine if all probes are empty
  4. Improved error messaging to differentiate between single vs multiple empty probes
  5. Added comprehensive test case for the multi-probe scenario

Related issue(s)

Fixes #341

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Health checks now aggregate results from multiple probes before deciding availability, avoiding immediate "unknown" reports and allowing partial probe success to mark an add-on available.
    • Consolidated empty-probe handling into a single post-check status update with a clearer message.
  • Tests

    • Added a test validating multi-probe aggregation when initial probes are empty but later probes succeed.
  • Integration

    • Updated integration expectation: unprobeable add-ons now treated as unavailable (False) instead of Unknown.

…configuration

When multiple health check probes are configured, if the first probe
returns empty feedback results, the health check was incorrectly marking
the addon as Unknown and returning early, preventing other probes from
being evaluated.

This fix changes the logic to:
- Collect all empty probe fields instead of returning immediately
- Only mark the addon as Unknown if ALL probes have no results
- Allow partial probe results to be considered valid
- Remove redundant check for non-empty results after healthChecker

Fixes open-cluster-management-io#341

Signed-off-by: Claude <[email protected]>
Signed-off-by: zhujian <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhujian7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Oct 24, 2025

Walkthrough

Aggregate multi-probe results in probeAddonStatusByWorks instead of early-returning on empty probe feedback; if all probes are empty set Available=Unknown, otherwise evaluate using collected results. Tests updated to cover multi-probe partial-empty scenarios and an integration expectation change.

Changes

Cohort / File(s) Summary
Probe aggregation logic
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go
Collect probe results into a local fieldResults slice for all probes, defer empty-probe decision until after loop, and set ManagedClusterAddOnConditionAvailable=Unknown only if every probe produced no results. Adjusted healthChecker invocation to use aggregated fieldResults.
Unit tests (healthcheck)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go
Added a test case for HealthCheckModeCustomized with a multi-probe prober where the first probe returns empty and the second returns valid results; expects Available=True with probe-available reason and message.
Integration test (agent deploy)
test/integration/kube/agent_deploy_test.go
Changed expected ManagedClusterAddOn Available condition when the addon cannot be probed from Unknown to False and updated related assertion/comment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

lgtm

Suggested reviewers

  • qiujian16
  • zhiweiyin318
  • elgnay

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Out of Scope Changes Check ❓ Inconclusive The changes to healthcheck_sync.go and healthcheck_sync_test.go are clearly in scope for fixing issue #341 regarding multi-probe health check behavior. However, the modification to agent_deploy_test.go warrants clarification: the test expectation is changed from ConditionUnknown to ConditionFalse when the addon cannot be probed. The PR description and objectives state that the addon should be marked Unknown only when ALL probes have no results, but this test change appears to contradict that by expecting False instead of Unknown. The PR documentation does not explicitly explain the rationale for this behavioral change, making it unclear whether this is a necessary consequence of the refactored logic or an out-of-scope modification. Please clarify why the agent_deploy_test.go expectation was changed from ConditionUnknown to ConditionFalse. If this change is intentional and related to the multi-probe fix, update the PR description to explain the behavioral change and ensure it aligns with the stated objective that addons should only be marked Unknown when all probes return no results. If this change is unrelated to the multi-probe fix, consider removing it from this PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "🐛 Fix health check to support partial probe results with multi-probe configuration" is directly related to the main change in the changeset. The title clearly identifies this as a bug fix (indicated by the 🐛 emoji) and specifically describes the core issue being addressed: enabling partial probe results support in multi-probe configurations. The title is concise, avoids vague language or noise, and a developer scanning commit history would immediately understand the primary change is about fixing health check behavior for multiple probes.
Linked Issues Check ✅ Passed The code changes directly address all coding requirements from issue #341. The modifications to healthcheck_sync.go implement the core solution: collecting results from all probes instead of returning immediately when the first probe has empty feedback, tracking empty probes with a new emptyProbeFields slice, and only setting the addon status to Unknown when ALL configured probes return no results. The new test case in healthcheck_sync_test.go validates the multi-probe scenario where the first probe returns empty results while the second returns valid results, confirming the addon Available condition is correctly set to True. This aligns with the expected behavior stated in issue #341 that at least one probe with valid results should result in a True available condition.
Description Check ✅ Passed The pull request description follows the repository template and includes all required sections. It contains a comprehensive Summary section that explains the problem (immediate return on first empty probe), the solution (defer Unknown decision until all probes are checked), and a detailed list of changes made. The Related issue(s) section properly links to issue #341 with "Fixes #341". The description provides sufficient context for reviewers to understand the fix, including the use case of multiple health check probes where some may return empty results while others have valid feedback.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (3)

202-206: Minor style: local var should be lowerCamel.

Rename FieldResults to fieldResults for idiomatic Go naming. Optional.


233-251: All-empty check and messaging look right; tiny UX tweak optional.

Logic is sound. Consider listing up to N missing identifiers (e.g., first 3) before falling back to a count to avoid very long messages in large configs. Optional.


252-265: Guard healthChecker call when no probe fields are configured.

If probeFields == 0, FieldResults stays empty and the all-empty check doesn’t trigger; healthChecker would receive an empty slice. Verify your healthChecker handles that. If not, add a small guard.

Proposed minimal guard:

@@
- if healthChecker != nil {
+ // Avoid calling healthChecker with no inputs when no probe fields are configured.
+ if len(probeFields) == 0 {
+   meta.SetStatusCondition(&addon.Status.Conditions, metav1.Condition{
+     Type:    addonapiv1alpha1.ManagedClusterAddOnConditionAvailable,
+     Status:  metav1.ConditionUnknown,
+     Reason:  addonapiv1alpha1.AddonAvailableReasonNoProbeResult,
+     Message: "No probe fields are configured for this add-on",
+   })
+   return nil
+ }
+ if healthChecker != nil {
   if err := healthChecker(FieldResults, cluster, addon); err != nil {

To confirm current behavior, please check utils.DeploymentAvailabilityHealthChecker and utils.WorkloadAvailabilityHealthChecker for empty-slice handling.

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (1)

1262-1329: Add a complementary all-empty multi-probe test.

Recommend adding a test where all configured probes return no feedback to assert:

  • Available condition Unknown
  • Reason AddonAvailableReasonNoProbeResult
  • Optionally, message uses the plural path.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2f5c52 and 38c5bec.

📒 Files selected for processing (2)
  • pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (2 hunks)
  • pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (3)
vendor/open-cluster-management.io/api/work/v1/types.go (1)
  • ResourceIdentifier (391-411)
vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (2)
  • ManagedClusterAddOnConditionAvailable (317-317)
  • AddonAvailableReasonNoProbeResult (369-369)
vendor/k8s.io/api/core/v1/types.go (1)
  • ConditionUnknown (3092-3092)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (3)
pkg/addonmanager/addontesting/helpers.go (1)
  • NewAddonWithConditions (90-100)
vendor/open-cluster-management.io/api/work/v1/types.go (11)
  • ManifestWork (18-28)
  • ManifestWorkSpec (37-57)
  • ManifestWorkStatus (465-479)
  • ManifestResourceStatus (483-492)
  • ManifestCondition (530-542)
  • ManifestResourceMeta (417-445)
  • StatusFeedbackResult (545-551)
  • FeedbackValue (553-565)
  • FieldValue (569-590)
  • Integer (596-596)
  • WorkAvailable (503-503)
vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (3)
  • HealthCheckModeCustomized (218-218)
  • ManagedClusterAddOnConditionAvailable (317-317)
  • AddonAvailableReasonProbeAvailable (377-377)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: build
  • GitHub Check: e2e-hosted
  • GitHub Check: unit
  • GitHub Check: e2e
  • GitHub Check: integration
  • GitHub Check: e2e-cloudevents
  • GitHub Check: verify
  • GitHub Check: verify-deps
🔇 Additional comments (2)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (1)

191-201: Good fix: defer Unknown and continue scanning probes.

Collecting empty probe fields and continuing instead of early-returning correctly enables partial-result handling across multi-probe configs.

pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (1)

1262-1329: Nice coverage for partial results across multi-probe.

This test nails the regression: empty first probe no longer masks valid later probes.

Simplify the probe result collection logic by:
- Collecting all probe results upfront, regardless of healthChecker
- Checking for empty results based on FieldResults length
- Removing unnecessary tracking of empty probe fields
- Updating test expectations for multi-probe scenarios

This makes the code cleaner and maintains the fix for supporting
partial probe results with multi-probe configurations.

Signed-off-by: Jia Zhu <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: zhujian <[email protected]>
@zhujian7 zhujian7 force-pushed the fix-multi-probe-empty-results branch from f7e3724 to b2750f5 Compare October 24, 2025 10:15
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (1)

192-198: Key fix: Continue checking remaining probes instead of early return.

This correctly addresses the core issue. When one probe returns no results, the code now continues to check other probes rather than immediately marking the addon as Unknown.

The comment at line 195 could be slightly clearer:

-		// if no results are returned. it is possible that work agent has not returned the feedback value.
-		// collect these fields and check if all probes are empty later
+		// If no results are returned, skip this probe field. It's possible the work agent
+		// has not returned feedback yet. We'll check if all probes are empty after the loop.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38c5bec and b2750f5.

📒 Files selected for processing (2)
  • pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (2 hunks)
  • test/integration/kube/agent_deploy_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
test/integration/kube/agent_deploy_test.go (1)
vendor/k8s.io/api/core/v1/types.go (1)
  • ConditionFalse (3091-3091)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (4)
pkg/agent/inteface.go (1)
  • FieldResult (205-211)
vendor/open-cluster-management.io/api/work/v1/types.go (1)
  • ResourceIdentifier (391-411)
vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (2)
  • ManagedClusterAddOnConditionAvailable (317-317)
  • AddonAvailableReasonNoProbeResult (369-369)
vendor/k8s.io/api/core/v1/types.go (1)
  • ConditionUnknown (3092-3092)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: e2e
  • GitHub Check: integration
  • GitHub Check: unit
  • GitHub Check: e2e-cloudevents
  • GitHub Check: e2e-hosted
  • GitHub Check: verify
🔇 Additional comments (5)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (4)

190-190: LGTM: Variable renamed to follow Go conventions.

The rename from FieldResults to fieldResults follows Go naming conventions for local variables.


200-229: LGTM: Correctly accumulates probe results for aggregated health checking.

The logic properly collects results from all probes that return feedback:

  • Accumulates results into fieldResults for later processing by healthChecker
  • Maintains backward compatibility with the deprecated healthCheck function
  • Continues to next probe when healthChecker is set, enabling multi-probe aggregation

231-240: Correctly handles the all-empty-probes case.

This post-loop check properly implements the fix: only mark the addon as Unknown when all configured probes return no results. If at least one probe returns feedback, the code proceeds to health checking, allowing partial probe results to be considered valid.


242-255: LGTM: Supports partial probe results with clear documentation.

The comments clearly explain the intent, and the healthChecker now correctly receives aggregated results from all probes that returned feedback. This enables the health checker to make informed decisions based on partial probe results.

test/integration/kube/agent_deploy_test.go (1)

925-926: LGTM: Test expectation correctly updated to reflect new behavior.

The change from ConditionUnknown to ConditionFalse is correct. Under the new logic:

  1. The deployment probe returns results (albeit with incorrect feedback field "ReplicasTest")
  2. The daemonset probe returns no results
  3. Since at least one probe has results, the code no longer sets Unknown early
  4. The healthChecker is invoked with the deployment's results
  5. The health checker fails because the deployment feedback is invalid and the daemonset is not probed
  6. Result: Available=False instead of Unknown

This properly validates the fix for partial probe results.

@zhujian7
Copy link
Member Author

/cc @zhiweiyin318 @qiujian16

@zhiweiyin318
Copy link
Member

/hold
/lgtm

@zhiweiyin318
Copy link
Member

feel free to help review this. cc @tesshuflower

@tesshuflower
Copy link
Contributor

/lgtm

I've also done some tests in my env with this fix and it's working well - Thanks @zhujian7 and @zhiweiyin318 for such a quick fix!

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2025

@tesshuflower: changing LGTM is restricted to collaborators

In response to this:

/lgtm

I've also done some tests in my env with this fix and it's working well - Thanks @zhujian7 and @zhiweiyin318 for such a quick fix!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@qiujian16
Copy link
Member

do we need to cherry-pick to release-1.1?

@zhujian7
Copy link
Member Author

/cherrypick release-1.1

@openshift-cherrypick-robot
Copy link
Contributor

@zhujian7: once the present PR merges, I will cherry-pick it on top of release-1.1 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zhujian7
Copy link
Member Author

/unhold

@openshift-merge-bot openshift-merge-bot bot merged commit 6e81408 into open-cluster-management-io:main Oct 27, 2025
15 checks passed
@openshift-cherrypick-robot
Copy link
Contributor

@zhujian7: new pull request created: #343

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cannot support empty FeedbackRule when there are multi health check ProbeFields in 1.1.0

5 participants