-
Notifications
You must be signed in to change notification settings - Fork 48
🐛 Fix health check to support partial probe results with multi-probe configuration #342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix health check to support partial probe results with multi-probe configuration #342
Conversation
…configuration When multiple health check probes are configured, if the first probe returns empty feedback results, the health check was incorrectly marking the addon as Unknown and returning early, preventing other probes from being evaluated. This fix changes the logic to: - Collect all empty probe fields instead of returning immediately - Only mark the addon as Unknown if ALL probes have no results - Allow partial probe results to be considered valid - Remove redundant check for non-empty results after healthChecker Fixes open-cluster-management-io#341 Signed-off-by: Claude <[email protected]> Signed-off-by: zhujian <[email protected]>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zhujian7 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAggregate multi-probe results in probeAddonStatusByWorks instead of early-returning on empty probe feedback; if all probes are empty set Available=Unknown, otherwise evaluate using collected results. Tests updated to cover multi-probe partial-empty scenarios and an integration expectation change. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (3)
202-206: Minor style: local var should be lowerCamel.Rename FieldResults to fieldResults for idiomatic Go naming. Optional.
233-251: All-empty check and messaging look right; tiny UX tweak optional.Logic is sound. Consider listing up to N missing identifiers (e.g., first 3) before falling back to a count to avoid very long messages in large configs. Optional.
252-265: Guard healthChecker call when no probe fields are configured.If probeFields == 0, FieldResults stays empty and the all-empty check doesn’t trigger; healthChecker would receive an empty slice. Verify your healthChecker handles that. If not, add a small guard.
Proposed minimal guard:
@@ - if healthChecker != nil { + // Avoid calling healthChecker with no inputs when no probe fields are configured. + if len(probeFields) == 0 { + meta.SetStatusCondition(&addon.Status.Conditions, metav1.Condition{ + Type: addonapiv1alpha1.ManagedClusterAddOnConditionAvailable, + Status: metav1.ConditionUnknown, + Reason: addonapiv1alpha1.AddonAvailableReasonNoProbeResult, + Message: "No probe fields are configured for this add-on", + }) + return nil + } + if healthChecker != nil { if err := healthChecker(FieldResults, cluster, addon); err != nil {To confirm current behavior, please check utils.DeploymentAvailabilityHealthChecker and utils.WorkloadAvailabilityHealthChecker for empty-slice handling.
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (1)
1262-1329: Add a complementary all-empty multi-probe test.Recommend adding a test where all configured probes return no feedback to assert:
- Available condition Unknown
- Reason AddonAvailableReasonNoProbeResult
- Optionally, message uses the plural path.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go(2 hunks)pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (3)
vendor/open-cluster-management.io/api/work/v1/types.go (1)
ResourceIdentifier(391-411)vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (2)
ManagedClusterAddOnConditionAvailable(317-317)AddonAvailableReasonNoProbeResult(369-369)vendor/k8s.io/api/core/v1/types.go (1)
ConditionUnknown(3092-3092)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (3)
pkg/addonmanager/addontesting/helpers.go (1)
NewAddonWithConditions(90-100)vendor/open-cluster-management.io/api/work/v1/types.go (11)
ManifestWork(18-28)ManifestWorkSpec(37-57)ManifestWorkStatus(465-479)ManifestResourceStatus(483-492)ManifestCondition(530-542)ManifestResourceMeta(417-445)StatusFeedbackResult(545-551)FeedbackValue(553-565)FieldValue(569-590)Integer(596-596)WorkAvailable(503-503)vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (3)
HealthCheckModeCustomized(218-218)ManagedClusterAddOnConditionAvailable(317-317)AddonAvailableReasonProbeAvailable(377-377)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: build
- GitHub Check: e2e-hosted
- GitHub Check: unit
- GitHub Check: e2e
- GitHub Check: integration
- GitHub Check: e2e-cloudevents
- GitHub Check: verify
- GitHub Check: verify-deps
🔇 Additional comments (2)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (1)
191-201: Good fix: defer Unknown and continue scanning probes.Collecting empty probe fields and continuing instead of early-returning correctly enables partial-result handling across multi-probe configs.
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync_test.go (1)
1262-1329: Nice coverage for partial results across multi-probe.This test nails the regression: empty first probe no longer masks valid later probes.
Simplify the probe result collection logic by: - Collecting all probe results upfront, regardless of healthChecker - Checking for empty results based on FieldResults length - Removing unnecessary tracking of empty probe fields - Updating test expectations for multi-probe scenarios This makes the code cleaner and maintains the fix for supporting partial probe results with multi-probe configurations. Signed-off-by: Jia Zhu <[email protected]> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: zhujian <[email protected]>
f7e3724 to
b2750f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (1)
192-198: Key fix: Continue checking remaining probes instead of early return.This correctly addresses the core issue. When one probe returns no results, the code now continues to check other probes rather than immediately marking the addon as Unknown.
The comment at line 195 could be slightly clearer:
- // if no results are returned. it is possible that work agent has not returned the feedback value. - // collect these fields and check if all probes are empty later + // If no results are returned, skip this probe field. It's possible the work agent + // has not returned feedback yet. We'll check if all probes are empty after the loop.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go(2 hunks)test/integration/kube/agent_deploy_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
test/integration/kube/agent_deploy_test.go (1)
vendor/k8s.io/api/core/v1/types.go (1)
ConditionFalse(3091-3091)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (4)
pkg/agent/inteface.go (1)
FieldResult(205-211)vendor/open-cluster-management.io/api/work/v1/types.go (1)
ResourceIdentifier(391-411)vendor/open-cluster-management.io/api/addon/v1alpha1/types_managedclusteraddon.go (2)
ManagedClusterAddOnConditionAvailable(317-317)AddonAvailableReasonNoProbeResult(369-369)vendor/k8s.io/api/core/v1/types.go (1)
ConditionUnknown(3092-3092)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: e2e
- GitHub Check: integration
- GitHub Check: unit
- GitHub Check: e2e-cloudevents
- GitHub Check: e2e-hosted
- GitHub Check: verify
🔇 Additional comments (5)
pkg/addonmanager/controllers/agentdeploy/healthcheck_sync.go (4)
190-190: LGTM: Variable renamed to follow Go conventions.The rename from
FieldResultstofieldResultsfollows Go naming conventions for local variables.
200-229: LGTM: Correctly accumulates probe results for aggregated health checking.The logic properly collects results from all probes that return feedback:
- Accumulates results into
fieldResultsfor later processing byhealthChecker- Maintains backward compatibility with the deprecated
healthCheckfunction- Continues to next probe when
healthCheckeris set, enabling multi-probe aggregation
231-240: Correctly handles the all-empty-probes case.This post-loop check properly implements the fix: only mark the addon as Unknown when all configured probes return no results. If at least one probe returns feedback, the code proceeds to health checking, allowing partial probe results to be considered valid.
242-255: LGTM: Supports partial probe results with clear documentation.The comments clearly explain the intent, and the
healthCheckernow correctly receives aggregated results from all probes that returned feedback. This enables the health checker to make informed decisions based on partial probe results.test/integration/kube/agent_deploy_test.go (1)
925-926: LGTM: Test expectation correctly updated to reflect new behavior.The change from
ConditionUnknowntoConditionFalseis correct. Under the new logic:
- The deployment probe returns results (albeit with incorrect feedback field "ReplicasTest")
- The daemonset probe returns no results
- Since at least one probe has results, the code no longer sets
Unknownearly- The
healthCheckeris invoked with the deployment's results- The health checker fails because the deployment feedback is invalid and the daemonset is not probed
- Result:
Available=Falseinstead ofUnknownThis properly validates the fix for partial probe results.
|
/hold |
|
feel free to help review this. cc @tesshuflower |
|
/lgtm I've also done some tests in my env with this fix and it's working well - Thanks @zhujian7 and @zhiweiyin318 for such a quick fix! |
|
@tesshuflower: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
do we need to cherry-pick to release-1.1? |
|
/cherrypick release-1.1 |
|
@zhujian7: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/unhold |
6e81408
into
open-cluster-management-io:main
|
@zhujian7: new pull request created: #343 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
This PR fixes issue #341 where the health check incorrectly marks addons as Unknown when using multiple health check probes and the first probe returns empty feedback results.
Problem:
When multiple health check ProbeFields are configured (e.g., checking both deployment and OLM), if the first probe returns empty feedback results, the code immediately returns and sets the addon status to Unknown. This prevents the evaluation of subsequent probes that may have valid results.
Solution:
Changes:
emptyProbeFieldsslice to track probes with no resultscontinuewhen a probe has no resultsRelated issue(s)
Fixes #341
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests
Integration