-
Notifications
You must be signed in to change notification settings - Fork 43
Proposal for Dynamic Scoring Framework Addon #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: KA-Takeuchi The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
please run to signoff the commit. |
| - Perform health checks on registered APIs and disable those that are not functioning properly. | ||
| - Manage API configurations (e.g., query range, frequency). | ||
| - In some cases, the API provider may want to enforce specific configurations; the framework should support applying these settings. | ||
| - In other cases, the PF operator may wish to set configurations according to their own requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does PF operator stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added clarification.
enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md
Show resolved
Hide resolved
|
|
||
| ### Non-Goals | ||
|
|
||
| Implementation of the evaluation logic itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some examples or library would make it easier to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added clarification.
|
|
||
| ##### Scoring API Schema Propagation | ||
|
|
||
| An example PromQL query to configure in the Scoring API source is shown below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a mechanism to surface the misconfiguration of query? That is useful since such query could fail and the user needs to know the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added error notification in Dynamic Scoring Framework Usage Flow Diagram.
enhancements/sig-architecture/166-dynamic-scoring-framework-addon/README.md
Show resolved
Hide resolved
|
|
||
| 1. it must have ```/healthz``` endpoint | ||
| 2. it must have ```/scoring``` endpoint (schema follows above section) | ||
| 3. it must have ```/config``` endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it implies user can either config through CR or use this endpoint? But this will leave some questions:
- how to handle conflict between CR based configuration and this endpoint?
- how this endpoint be authenticated and authzed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added clarification. I think it’s fine for the /config endpoint to be publicly accessible, so I don’t see a need for token protection. What’s your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah it makes sense.
|
|
||
| ### Alternative 2: No use prometheus-compatible component | ||
|
|
||
| - Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this as an valid option, since not every cluster has prometheus deployed.
Being able to support different metric sources seems like good item for beta phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your opinion. On the other hand, if we try to use OpenTelemetry directly, the Agent would need to include a mechanism to store the data, right?
Do you have any suggestions on what to use for data store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends. If only transient metrics are used, persistence is not necessary. I think we can also configure persistence for otel collector. There are certainly some limitation of using otel, but it is much simpler and more lightweight than prometheus.
| ```yaml | ||
| spec: | ||
| mask: | ||
| - clusterName: "cluster1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this to define "Do not generate this score in this cluster"? With that shouldn't the cluster have a very low score so scheduler will not pick it? Why this need to be specifically configured? I think agent can find that this metrics does not exist and just generate a low default score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added clarification. The mask is meant to reduce unnecessary access to the Scoring API. Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm, it is not clear to me how would user decide that a certain score should not be reported. It seems like the user needs to at first find out that there is no certain metrics on a cluster then decide whether the score needs to be masked? But I agree this is useful in some cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. From my perspective, I was assuming a scenario where the user already owns on-premise servers and has a prior understanding of the cluster’s hardware specifications, so that’s why I was considering this kind of feature.
In a cloud environment, the needs might be a bit different.
|
finish the 1st pass, I think the the content is pretty good. |
|
|
||
| 6. **Visualize and utilize scores with Prometheus/Grafana/ResourceArrangement, etc.** | ||
|
|
||
| ```plantuml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub does not offer native, built-in rendering of PlantUML diagrams. Would it be possible to replace this with the Mermaid format? It will be easier for others to read if they are interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
|
||
| A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows: | ||
|
|
||
| ```plantuml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wondering if the data structures can be directly represented using Golang? For example: https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md#addonplacementscore-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| @enduml | ||
| ``` | ||
|
|
||
| ScorerSummary is a flattened summary of DynamicScorer CR information and is distributed to managed clusters as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ScorerSummary be distributed directly to managed clusters or wrapped inside a configmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's wrapped inside a configmap.
Signed-off-by: Kazuma Takeuchi <[email protected]>
Removed superseded-by section from metadata. Signed-off-by: Kazuma Takeuchi <[email protected]>
Signed-off-by: Kazuma Takeuchi <[email protected]>
Signed-off-by: Kazuma Takeuchi <[email protected]>
|
|
||
| Implementation of the evaluation logic itself. | ||
| The evaluation logic is expected to be implemented outside the Dynamic Scoring Framework. | ||
| (But hte interfaces for Scoring APIs are defined in the framework.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hte -> the
| spec: | ||
| description: A simple prediction scorer for time series data | ||
| scoreDestination: AddOnPlacementScore | ||
| scoreDimensionFormat: "${node}-${namespace}-${pod}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be a slice? since you can put multiple scores in addonPlacementScore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some notes about the behavior when there are multiple dimensions. Does this match your intention?
|
|
||
| 1. it must have ```/healthz``` endpoint | ||
| 2. it must have ```/scoring``` endpoint (schema follows above section) | ||
| 3. it must have ```/config``` endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah it makes sense.
| ```yaml | ||
| spec: | ||
| mask: | ||
| - clusterName: "cluster1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm, it is not clear to me how would user decide that a certain score should not be reported. It seems like the user needs to at first find out that there is no certain metrics on a cluster then decide whether the score needs to be masked? But I agree this is useful in some cases
|
|
||
| ### Alternative 2: No use prometheus-compatible component | ||
|
|
||
| - Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends. If only transient metrics are used, persistence is not necessary. I think we can also configure persistence for otel collector. There are certainly some limitation of using otel, but it is much simpler and more lightweight than prometheus.
Signed-off-by: Kazuma Takeuchi <[email protected]>
Signed-off-by: Kazuma Takeuchi <[email protected]>
|
|
||
| #### DynamicScoringConfig Definition Details | ||
|
|
||
| DynamicScoringConfig is a CR that aggregates the current information of registered DynamicScorers and distributes it to managed clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be more explanation here regarding DynamicScoringConfig's structure and example?
|
|
||
| A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows: | ||
|
|
||
| ```go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to specify the required and optional fields here for clarity.
| @@ -0,0 +1,14 @@ | |||
| title: dynamic-scoring-framework-addon | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pls rename the 83 to 166 as 166 as the related issue number.
Signed-off-by: Kazuma Takeuchi <[email protected]>
This is an enhancement proposal for Dynamic Scoring Framework addon.
Please review this PR when you have a chance.