Skip to content

Conversation

@KA-Takeuchi
Copy link

This is an enhancement proposal for Dynamic Scoring Framework addon.
Please review this PR when you have a chance.

@openshift-ci openshift-ci bot requested review from deads2k and qiujian16 November 17, 2025 06:01
@openshift-ci
Copy link

openshift-ci bot commented Nov 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KA-Takeuchi
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@KA-Takeuchi KA-Takeuchi changed the title Proposal for Dynamic Scoring Framework Adon Proposal for Dynamic Scoring Framework Addon Nov 17, 2025
@qiujian16
Copy link
Member

please run

git commit --amend -s
git push -f

to signoff the commit.

- Perform health checks on registered APIs and disable those that are not functioning properly.
- Manage API configurations (e.g., query range, frequency).
- In some cases, the API provider may want to enforce specific configurations; the framework should support applying these settings.
- In other cases, the PF operator may wish to set configurations according to their own requirements.
Copy link
Member

@qiujian16 qiujian16 Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does PF operator stand for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification.


### Non-Goals

Implementation of the evaluation logic itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some examples or library would make it easier to use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification.


##### Scoring API Schema Propagation

An example PromQL query to configure in the Scoring API source is shown below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a mechanism to surface the misconfiguration of query? That is useful since such query could fail and the user needs to know the reason.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error notification in Dynamic Scoring Framework Usage Flow Diagram.


1. it must have ```/healthz``` endpoint
2. it must have ```/scoring``` endpoint (schema follows above section)
3. it must have ```/config``` endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it implies user can either config through CR or use this endpoint? But this will leave some questions:

  1. how to handle conflict between CR based configuration and this endpoint?
  2. how this endpoint be authenticated and authzed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification. I think it’s fine for the /config endpoint to be publicly accessible, so I don’t see a need for token protection. What’s your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah it makes sense.


### Alternative 2: No use prometheus-compatible component

- Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this as an valid option, since not every cluster has prometheus deployed.

Being able to support different metric sources seems like good item for beta phase.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your opinion. On the other hand, if we try to use OpenTelemetry directly, the Agent would need to include a mechanism to store the data, right?
Do you have any suggestions on what to use for data store?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends. If only transient metrics are used, persistence is not necessary. I think we can also configure persistence for otel collector. There are certainly some limitation of using otel, but it is much simpler and more lightweight than prometheus.

```yaml
spec:
mask:
- clusterName: "cluster1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to define "Do not generate this score in this cluster"? With that shouldn't the cluster have a very low score so scheduler will not pick it? Why this need to be specifically configured? I think agent can find that this metrics does not exist and just generate a low default score?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification. The mask is meant to reduce unnecessary access to the Scoring API. Make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, it is not clear to me how would user decide that a certain score should not be reported. It seems like the user needs to at first find out that there is no certain metrics on a cluster then decide whether the score needs to be masked? But I agree this is useful in some cases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. From my perspective, I was assuming a scenario where the user already owns on-premise servers and has a prior understanding of the cluster’s hardware specifications, so that’s why I was considering this kind of feature.
In a cloud environment, the needs might be a bit different.

@qiujian16
Copy link
Member

finish the 1st pass, I think the the content is pretty good.


6. **Visualize and utilize scores with Prometheus/Grafana/ResourceArrangement, etc.**

```plantuml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub does not offer native, built-in rendering of PlantUML diagrams. Would it be possible to replace this with the Mermaid format? It will be easier for others to read if they are interested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows:

```plantuml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@enduml
```

ScorerSummary is a flattened summary of DynamicScorer CR information and is distributed to managed clusters as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ScorerSummary be distributed directly to managed clusters or wrapped inside a configmap?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrapped inside a configmap.

KA-Takeuchi and others added 2 commits November 20, 2025 15:39
Removed superseded-by section from metadata.

Signed-off-by: Kazuma Takeuchi <[email protected]>

Implementation of the evaluation logic itself.
The evaluation logic is expected to be implemented outside the Dynamic Scoring Framework.
(But hte interfaces for Scoring APIs are defined in the framework.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hte -> the

spec:
description: A simple prediction scorer for time series data
scoreDestination: AddOnPlacementScore
scoreDimensionFormat: "${node}-${namespace}-${pod}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be a slice? since you can put multiple scores in addonPlacementScore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some notes about the behavior when there are multiple dimensions. Does this match your intention?


1. it must have ```/healthz``` endpoint
2. it must have ```/scoring``` endpoint (schema follows above section)
3. it must have ```/config``` endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah it makes sense.

```yaml
spec:
mask:
- clusterName: "cluster1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, it is not clear to me how would user decide that a certain score should not be reported. It seems like the user needs to at first find out that there is no certain metrics on a cluster then decide whether the score needs to be masked? But I agree this is useful in some cases


### Alternative 2: No use prometheus-compatible component

- Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends. If only transient metrics are used, persistence is not necessary. I think we can also configure persistence for otel collector. There are certainly some limitation of using otel, but it is much simpler and more lightweight than prometheus.


#### DynamicScoringConfig Definition Details

DynamicScoringConfig is a CR that aggregates the current information of registered DynamicScorers and distributes it to managed clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be more explanation here regarding DynamicScoringConfig's structure and example?


A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows:

```go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to specify the required and optional fields here for clarity.

@@ -0,0 +1,14 @@
title: dynamic-scoring-framework-addon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pls rename the 83 to 166 as 166 as the related issue number.

Signed-off-by: Kazuma Takeuchi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants