Proposal for Dynamic Scoring Framework Addon #165

KA-Takeuchi · 2025-11-17T06:01:20Z

This is an enhancement proposal for Dynamic Scoring Framework addon.
Please review this PR when you have a chance.

openshift-ci · 2025-11-17T06:01:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KA-Takeuchi
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

qiujian16 · 2025-11-19T08:09:27Z

please run

git commit --amend -s
git push -f

to signoff the commit.

qiujian16 · 2025-11-17T07:30:49Z

enhancements/sig-architecture/83-dynamic-scoring-framework-addon/README.md

+  - Perform health checks on registered APIs and disable those that are not functioning properly.
+  - Manage API configurations (e.g., query range, frequency).
+    - In some cases, the API provider may want to enforce specific configurations; the framework should support applying these settings.
+    - In other cases, the PF operator may wish to set configurations according to their own requirements.


What does PF operator stand for?

Added clarification.

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

qiujian16 · 2025-11-19T08:11:14Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+
+### Non-Goals
+
+Implementation of the evaluation logic itself.


some examples or library would make it easier to use.

Added clarification.

qiujian16 · 2025-11-19T08:13:51Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+
+##### Scoring API Schema Propagation
+
+An example PromQL query to configure in the Scoring API source is shown below.


Is there a mechanism to surface the misconfiguration of query? That is useful since such query could fail and the user needs to know the reason.

Added error notification in Dynamic Scoring Framework Usage Flow Diagram.

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

qiujian16 · 2025-11-19T08:20:47Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+
+1. it must have ```/healthz``` endpoint
+2. it must have ```/scoring``` endpoint (schema  follows above section)
+3. it must have ```/config``` endpoint


I think it implies user can either config through CR or use this endpoint? But this will leave some questions:

how to handle conflict between CR based configuration and this endpoint?

how this endpoint be authenticated and authzed.

Added clarification. I think it’s fine for the /config endpoint to be publicly accessible, so I don’t see a need for token protection. What’s your opinion?

I see, yeah it makes sense.

qiujian16 · 2025-11-19T08:24:09Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+
+### Alternative 2: No use prometheus-compatible component
+
+- Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry).


I would put this as an valid option, since not every cluster has prometheus deployed.

Being able to support different metric sources seems like good item for beta phase.

I agree with your opinion. On the other hand, if we try to use OpenTelemetry directly, the Agent would need to include a mechanism to store the data, right?
Do you have any suggestions on what to use for data store?

I think it depends. If only transient metrics are used, persistence is not necessary. I think we can also configure persistence for otel collector. There are certainly some limitation of using otel, but it is much simpler and more lightweight than prometheus.

qiujian16 · 2025-11-19T08:28:25Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+```yaml
+spec:
+  mask:
+    - clusterName: "cluster1"


is this to define "Do not generate this score in this cluster"? With that shouldn't the cluster have a very low score so scheduler will not pick it? Why this need to be specifically configured? I think agent can find that this metrics does not exist and just generate a low default score?

Added clarification. The mask is meant to reduce unnecessary access to the Scoring API. Make sense?

hrm, it is not clear to me how would user decide that a certain score should not be reported. It seems like the user needs to at first find out that there is no certain metrics on a cluster then decide whether the score needs to be masked? But I agree this is useful in some cases

Understood. From my perspective, I was assuming a scenario where the user already owns on-premise servers and has a prior understanding of the cluster’s hardware specifications, so that’s why I was considering this kind of feature.
In a cloud environment, the needs might be a bit different.

qiujian16 · 2025-11-19T08:30:19Z

finish the 1st pass, I think the the content is pretty good.

haoqing0110 · 2025-11-19T09:18:39Z

enhancements/sig-architecture/83-dynamic-scoring-framework-addon/README.md

+
+6. **Visualize and utilize scores with Prometheus/Grafana/ResourceArrangement, etc.**
+
+```plantuml


GitHub does not offer native, built-in rendering of PlantUML diagrams. Would it be possible to replace this with the Mermaid format? It will be easier for others to read if they are interested.

haoqing0110 · 2025-11-19T09:21:50Z

enhancements/sig-architecture/83-dynamic-scoring-framework-addon/README.md

+
+A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows:
+
+```plantuml


Also wondering if the data structures can be directly represented using Golang? For example: https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md#addonplacementscore-api

haoqing0110 · 2025-11-19T09:27:02Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+@enduml
+```
+
+ScorerSummary is a flattened summary of DynamicScorer CR information and is distributed to managed clusters as follows:


Is ScorerSummary be distributed directly to managed clusters or wrapped inside a configmap?

It's wrapped inside a configmap.

Signed-off-by: Kazuma Takeuchi <[email protected]>

Removed superseded-by section from metadata. Signed-off-by: Kazuma Takeuchi <[email protected]>

Signed-off-by: Kazuma Takeuchi <[email protected]>

qiujian16 · 2025-11-20T10:09:10Z

enhancements/sig-architecture/83-dynamic-scoring-framework-addon/README.md

+
+Implementation of the evaluation logic itself.
+The evaluation logic is expected to be implemented outside the Dynamic Scoring Framework.
+(But hte interfaces for Scoring APIs are defined in the framework.)


qiujian16 · 2025-11-20T10:11:51Z

enhancements/sig-architecture/83-dynamic-scoring-framework-addon/README.md

+spec:
+  description: A simple prediction scorer for time series data
+  scoreDestination: AddOnPlacementScore
+  scoreDimensionFormat: "${node}-${namespace}-${pod}"


could this be a slice? since you can put multiple scores in addonPlacementScore

I added some notes about the behavior when there are multiple dimensions. Does this match your intention?

qiujian16 · 2025-11-20T10:14:03Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+
+1. it must have ```/healthz``` endpoint
+2. it must have ```/scoring``` endpoint (schema  follows above section)
+3. it must have ```/config``` endpoint


I see, yeah it makes sense.

qiujian16 · 2025-11-20T10:25:30Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+```yaml
+spec:
+  mask:
+    - clusterName: "cluster1"


hrm, it is not clear to me how would user decide that a certain score should not be reported. It seems like the user needs to at first find out that there is no certain metrics on a cluster then decide whether the score needs to be masked? But I agree this is useful in some cases

qiujian16 · 2025-11-20T10:30:48Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+
+### Alternative 2: No use prometheus-compatible component
+
+- Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry).


I think it depends. If only transient metrics are used, persistence is not necessary. I think we can also configure persistence for otel collector. There are certainly some limitation of using otel, but it is much simpler and more lightweight than prometheus.

Signed-off-by: Kazuma Takeuchi <[email protected]>

haoqing0110 · 2025-11-27T04:08:37Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+
+#### DynamicScoringConfig Definition Details
+
+DynamicScoringConfig is a CR that aggregates the current information of registered DynamicScorers and distributes it to managed clusters.


Should there be more explanation here regarding DynamicScoringConfig's structure and example?

haoqing0110 · 2025-11-27T04:21:12Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md

+
+A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows:
+
+```go


It would be helpful to specify the required and optional fields here for clarity.

qiujian16 · 2025-11-27T02:57:26Z

enhancements/sig-architecture/166-dynamic-scoring-framework-addon/metadata.yaml

@@ -0,0 +1,14 @@
+title: dynamic-scoring-framework-addon


nit: pls rename the 83 to 166 as 166 as the related issue number.

Signed-off-by: Kazuma Takeuchi <[email protected]>

openshift-ci bot requested review from deads2k and qiujian16 November 17, 2025 06:01

KA-Takeuchi changed the title ~~Proposal for Dynamic Scoring Framework Adon~~ Proposal for Dynamic Scoring Framework Addon Nov 17, 2025

qiujian16 reviewed Nov 19, 2025

View reviewed changes

haoqing0110 reviewed Nov 19, 2025

View reviewed changes

KA-Takeuchi and others added 2 commits November 20, 2025 15:39

add files for Dynamic Scoring Framework addon

456fe08

Signed-off-by: Kazuma Takeuchi <[email protected]>

Remove superseded-by section from metadata.yaml

e19c103

Removed superseded-by section from metadata. Signed-off-by: Kazuma Takeuchi <[email protected]>

KA-Takeuchi force-pushed the main branch from 937be17 to e19c103 Compare November 20, 2025 06:41

KA-Takeuchi added 2 commits November 20, 2025 17:02

Address initial comment regarding the Dynamic Scoring Framework Addon

e935451

Signed-off-by: Kazuma Takeuchi <[email protected]>

Address 2nd comment regarding the Dynamic Scoring Framework Addon

5842055

Signed-off-by: Kazuma Takeuchi <[email protected]>

qiujian16 reviewed Nov 20, 2025

View reviewed changes

KA-Takeuchi added 2 commits November 21, 2025 13:00

add MCP server description

b2851c5

Signed-off-by: Kazuma Takeuchi <[email protected]>

add scoreDimensionFormat description

ff286b8

Signed-off-by: Kazuma Takeuchi <[email protected]>

haoqing0110 reviewed Nov 27, 2025

View reviewed changes

qiujian16 reviewed Nov 27, 2025

View reviewed changes

fix title number

0792472

Signed-off-by: Kazuma Takeuchi <[email protected]>


		##### Scoring API Schema Propagation

		An example PromQL query to configure in the Scoring API source is shown below.


		### Alternative 2: No use prometheus-compatible component

		- Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry).


		6. Visualize and utilize scores with Prometheus/Grafana/ResourceArrangement, etc.

		```plantuml


		A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows:

		```plantuml


		#### DynamicScoringConfig Definition Details

		DynamicScoringConfig is a CR that aggregates the current information of registered DynamicScorers and distributes it to managed clusters.


		A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows:

		```go

Proposal for Dynamic Scoring Framework Addon #165

Are you sure you want to change the base?

Proposal for Dynamic Scoring Framework Addon #165

Uh oh!

Conversation

KA-Takeuchi commented Nov 17, 2025

Uh oh!

openshift-ci bot commented Nov 17, 2025

Uh oh!

qiujian16 commented Nov 19, 2025

Uh oh!

qiujian16 Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qiujian16 commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

qiujian16 Nov 17, 2025 •

edited

Loading