Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions enhancements/sig-architecture/179-work-feedback-scrape-type/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/)

## Summary
Currently, the ManifestWork controller periodically polls (every 60s) resource
statuses in managed clusters to populate statusFeedback values. This enhancement
introduces a new field, `feedbackScrapeType` to ManifestWorkReplicaSet CRD allowing
users to specify whether status feedback should be collected via polling or
watch-based mechanisms.

The MWRS controller should also be updated to explicitly watch and reconcile on changes
in both ManifestWork.status.conditions and ManifestWork.status.feedbacks, when
`feedbackScrapeType` == `WATCH` ensuring that feedback updates are properly aggregated
into MWRS status in real time.

This feature enhances responsiveness and efficiency by enabling near-real-time status
feedback updates for selected resources while preserving backward compatibility and
operator control.

## Motivation

Today, the ManifestWork controller on the managed cluster periodically polls resources
every 60 seconds to collect status feedback defined in manifestConfigs. While this
polling mechanism ensures eventual consistency, it introduces unnecessary latency for
resources that change frequently and adds overhead from repeated list-and-get operations,
even when no changes occur. This limits OCM’s ability to provide near-real-time
visibility of resource states and can delay downstream reconciliation actions or
status reporting in large environments.

By introducing a new field, feedbackScrapeType, OCM allows users and platform operators
to choose how feedback is collected—either through the existing periodic polling model
(POLL) or an event-driven watch model (WATCH). With WATCH, the managed cluster agent can
immediately react to changes in specified resources, updating the corresponding
ManifestWork and ensuring faster and more efficient feedback propagation to the hub.
This reduces the need for fixed polling cadences, minimizes API load, and improves
responsiveness for high-value workloads that require timely status updates.

For the ManifestWorkReplicaSet (MWRS) controller, this change ensures it accurately
reflects aggregated feedback in real time, responding to both condition and feedback
updates from its associated ManifestWork objects. This makes the OCM feedback loop
more dynamic, scalable, and adaptive to workload characteristics.

### Goals

* Provide configurable feedback collection modes (POLL or WATCH) per resource in ManifestWork.

* Ensure MWRS correctly aggregates and reflects feedback updates from both polling and watch-based sources.

* Maintain full backward compatibility for existing users and automation.

### Non-Goals

* Changing the core aggregation semantics in MWRS beyond supporting the new event-driven updates.

## Proposal

This is where we get down to the nitty gritty of what the proposal actually is.

### User Stories

Detail the things that people will be able to do if this is implemented.
Include as much detail as possible so that people can understand the "how" of
the system. The goal here is to make this feel real for users without getting
bogged down.

#### Story 1

User A uses MWRS to deploy a rollout to multiple cluster. The user wants
to track the number of pod replicas that contain the change as it is rolled out to the cluster on the MWRS level.
To do this, the user must get the replica count as the change is rolling out, if it relies on the existing mechanism, it is likely that the change will be missed.

For example, if the rollout happens every 10 seconds, then User A will not be able to track the per replica count. And only get the count when the rollout is completed.

### Implementation Details/Notes/Constraints [optional]

What are the caveats to the implementation? What are some important details that
didn't come across above. Go in to as much detail as necessary here. This might
be a good place to talk about core concepts and how they relate.

### Risks and Mitigation

#### Risk #1: Too many watch events and update calls to the ManifestWork.

This will be mitigated by limiting when the ManifestWork controller sends updates to the ManifestWork and to stop processing WATCH events once the ManifestWork is "Ready".

ManifestWork Controller will parse the response from the informer watch and only send an update to the ManifestWork if there is an update on the feedbackRules.

ManifestWork Controller that registers informers per resource, will unregister the informer once the ManifestWork is Ready.


## Design Details
### ManifestWorkReplicaSet CRD changes
```yaml
apiVersion: work.open-cluster-management.io/v1alpha1
kind: ManifestWorkReplicaSet
...
spec:
workload: ...
manifestConfigs:
- resourceIdentifier:
group: apps
resource: deployments
namespace: default
name: hello
feedbackScrapeType: WATCH (Options: WATCH | POLL, POLL is default behavior)
....
```

#### ManifestWork CRD changes
```yaml
apiVersion: work.open-cluster-management.io/v1alpha1
kind: ManifestWork
...
spec:
manifestConfigs:
- resourceIdentifier:
group: apps
resource: deployments
namespace: default
name: hello
feedbackScrapeType: WATCH (Options: WATCH | POLL, POLL is default behavior)
....
```

This change will also be populated upwards to the ManifestWorkReplicaSet level of
`spec.manifestWorkTemplate.manifestConfigs`.

```mermaid
sequenceDiagram
participant A as Resources managed by OCM
participant B as ManifestWorks
participant C as ManifestWorkReplicaSet

B->>A: (POLL) every 60s
A->>B: feedbackRules & conditionRule Update
B->>C: Watch MW, Placement, observedGeneration updates
C->>C: Update status.Conditions

opt if feedbackScrapeType == WATCH
A->>B: (WATCH) event for resource changes
B->>B: feedbackRules & conditionRule Updates (if there is any update)
B->>C: (WATCH) for feedbackRule updates only (if Progressing = True & Degraded = False|not set)
C->>C: Update status.Conditions
end
```

#### Spoke Level ManifestWork Controller Reconciler

Default: resources continue to be polled on the existing interval.
If a resource is marked WATCH, the controller registers it for watch-based updates:
Immediate status/feedback recomputation on events.
Poll remains as a safety net.

Create a FeedbackScrape interface that polls or watches resources.

In the ManifestWork Controller (pkg/work/spoke/controllers/statuscontroller/availablestatus_controller.go)
introduce a watch-based path alongside the existing poll loop.

When syncing the ManifestWork, register a informer for the resource if `feedbackScrapeType` WATCH. If the `feedbackScrapeType` is no longer WATCH, unregister the informer. If the ManifestWork is in Ready state (Progressing == False and Degraded == False / NotSet), unregister the informer. This will prevent long running WATCH from triggering too many ManifestWork changes. If there is new rollout, and the status changes to Progressing, then register the informer again.
Copy link
Member

@zhujian7 zhujian7 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more details about the register/unregister informer part? I have a concern that in some cases, even with this watch informer, we might still not observe the whole status change process of the resource, since it may take several seconds to wait for the informer cache to sync when starting the informer. For example, if the replica number changes from 1->2->3 in 5 seconds, and the waiting for the deployment informer cache sync process takes 6 seconds, the result we see will still be 1->3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of the situation on start up? Or just in general when the informer is running? My expectation is that the informer should see every change (which I think is the guarantee of the informer) and that the ManifestWork feedbackStatus will get updated with that change. And when the ManifestWork statusFeedback updates, we will populate this upwards to the MWRS controller.

From my understanding, which step will be missed above since it takes several seconds to wait for informer cache to sync?

The informer will be registered when a new rollout is started with a new MWRS.spec and observedGeneration for the ManifestWork. The informer will never get unregistered unless the feedbackScrapeType is removed from that resource. However, once the MWRS is marked as Ready, the informer will stop sending updates on add/update/delete to the statusFeedback to prevent triggering resync on the MWRS controller level.

Copy link
Member

@zhujian7 zhujian7 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking of the situation every time the register starts up. Let us first leave the manifestworkreplicaset alone and focus on the manifestwork. And let us say there is a manifestwork:

apiVersion: work.open-cluster-management.io/v1
kind: ManifestWork
metadata:
  namespace: <target managed cluster>
  name: hello-work-demo
spec:
  manifestConfigs:
    - resourceIdentifier:
        group: apps
        resource: deployments
        namespace: default
        name: hello
      feedbackScrapeType: WATCH
  workload:
    manifests:
      - apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: hello
          namespace: default
        spec:
          replicas: 1
          selector:
            matchLabels:
              app: hello
          template:
            metadata:
              labels:
                app: hello
            spec:
              containers:
                - name: hello
                  image: quay.io/asmacdo/busybox
                  command:
                    ["sh", "-c", 'echo "Hello, Kubernetes!" && sleep 3600']

After this is applied on the hub, will we register a deployment informer for this? And after the manifestwork becomes Applied true, Available true, will we unregister the informer?

And then if we change the replica of the deployment in the manifestwork from 1 to 3, will we re-register a deployment informer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will register informer when the resource is created, the informer will keep on running until the resource is deleted. Its conditions will not impact informer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the above description about unregister should be updated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ManifestWork is in Ready state (Progressing == False and Degraded == False / NotSet), unregister the informer.

I think this should be updated.


When the informer recieves an event (create, update, delete), verify that there is a change on the fields defined in the `feedbackRules` and patch the status conditions for that resource. If there is no change on one of the `feedbackRules` field, ignore the WATCH event.

#### Hub Level ManifestWorkReplicaSet Controller Reconciler

There should be no updates on the MWRS reconciler. The expected behavior is that when the ManifestWork has an update in the status feedback, trigger a resync of the MWRS to get the status feedback details.

In the current model, the MWRS controller is reconciled if there are changes to the ManifestWork (that are managed by manifestworkreplicaset), PlacementDecisions and Placements. Therefore any update to the status.Feedback on the ManifestWork should trigger the reconcile. Standard factory used [here](https://github.com/open-cluster-management-io/ocm/blob/main/pkg/work/hub/manager.go#L97-L98)


### Open Questions [optional]

TBD

### Upgrade / Downgrade Strategy

With the `POLL` flag or not set `feedbackScrapeType`, controllers will fallback to the previous behavior.
## Alternatives

TBD

## Infrastructure Needed [optional]

TBD
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
title: neat-enhancement-idea
authors:
- "@annelaucg"
- "@youngbupark"
reviewers:
- "TBD"
approvers:
- "TBD"
creation-date: 2025-10-22
last-updated: 2025-10-22
status: provisional
Loading