-
Notifications
You must be signed in to change notification settings - Fork 377
feat: add NVIDIA Resourcequota #1359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @FouoF, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant feature by adding resource quota support for NVIDIA GPUs. It enables administrators to define and enforce limits on GPU memory and core usage at the namespace level, for both individual and multiple GPU requests. The changes involve a new, centralized quota management system, integration of quota checks directly into the device allocation logic, and updates to the scheduler to dynamically track and respond to Kubernetes ResourceQuota objects, ensuring efficient and fair resource distribution. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces resource quota support, primarily targeting NVIDIA GPUs. It adds a QuotaManager
to track usage and limits for GPU memory and cores, and integrates this with the scheduler. The Devices
interface is updated to support this, though some parts of the implementation are specific to NVIDIA.
My review has identified a few critical issues, including a hardcoded check for NVIDIA devices that prevents the feature from being generic, and a deadlock in the pod deletion logic. I've also pointed out several medium-severity issues related to thread safety, potential panics, and code robustness in the new quota management logic. Addressing these points will improve the correctness and reliability of this new feature.
6b1374c
to
18f059d
Compare
Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]>
18f059d
to
e9c1615
Compare
cc @archlitchi |
cc @archlitchi |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces resource quota support, primarily for NVIDIA devices. It adds a QuotaManager
to track resource usage against limits defined in ResourceQuota
objects. The scheduler is updated to use this manager during pod filtering and to keep usage up-to-date based on pod and quota events. The implementation is mostly generic, but a key function for calculating pod resource usage is hardcoded for NVIDIA, which limits the feature's applicability to other device types. Additionally, there's a minor point about string manipulation that could be improved for safety. The new tests for the quota feature are comprehensive and well-written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FouoF Is there any docs about how to config with Resourcequota in https://github.com/Project-HAMi/website ?
|
Signed-off-by: Jifei Wang <[email protected]>
65d8ed8
to
6aca355
Compare
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
@FouoF I think it's better to add this quota usage to scheduler monitor |
Signed-off-by: Jifei Wang <[email protected]>
a923caf
to
d2957d4
Compare
added |
podDev, _ := device.DecodePodDevices(device.SupportDevices, pod.Annotations) | ||
s.addPod(pod, nodeID, podDev) | ||
if s.addPod(pod, nodeID, podDev) { | ||
s.quotaManager.AddUsage(pod, podDev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FouoF If this is the only place to call AddUsage
of QuotaManager
, there is a chance that ResourceQuota may not work well in a situation that multiple Pods calls the FitQuota
of QuotaManager
simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both addPod and AddUsage are protected by their mutex lock, so i think it's thread-safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@archlitchi It is not about the mutex lock that make it thread-safe. It is about the timing when to add usage to the QuotaManager
.
Consider the following situation that quota for target namespace remains resources fitting just one Pod and now Pod A and B call the FitQuota
simultaneously in Filter
phase, because Used
counter inside QuotaManager has not been added up right after the first one calls, both will pass the quota check(let me know if i miss understand something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FouoF @archlitchi I checked the code of ResourceQuota
admission plugin in Kubernetes API Server, it used the workqueue which is indexed by namespace, perhaps we can take it as reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a namespace scope mutex lock solve it? If this namespace quota is in use, directly deny it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FouoF I think a namespace-scoped reentrantLock which can be acquired in FitQuota
and released in AddUsage
may solve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@archlitchi It is not about the mutex lock that make it thread-safe. It is about the timing when to add usage to the
QuotaManager
.Consider the following situation that quota for target namespace remains resources fitting just one Pod and now Pod A and B call the
FitQuota
simultaneously inFilter
phase, becauseUsed
counter inside QuotaManager has not been added up right after the first one calls, both will pass the quota check(let me know if i miss understand something).
yes, you are right, but this problem is beyond the resourceQuota scope, it explains this issue: #1330, when multiple tasks pass the Fit() and before AddResource(). I think we need to carefully design how to sync these on v2.8, and add it to RoadMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@archlitchi All right. I suggest adding a note about concurrency issue in doc of resourceQuota in this version to remind user that it may not work under high pressure.
Signed-off-by: Jifei Wang <[email protected]>
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces resource quota support for NVIDIA devices, which is a valuable addition. The implementation includes a QuotaManager
to track resource usage against defined quotas, updates to the scheduler to enforce quotas, and new metrics for monitoring. My review has identified a few areas for improvement:
- A leftover debug log statement should be cleaned up.
- The quota calculation logic is currently hardcoded for NVIDIA devices, which limits the feature's applicability to other device types. Generalizing this would be beneficial.
- The node informer is no longer handling update events, which could cause the scheduler to operate on stale node information.
Overall, the changes are well-structured, and the new tests provide good coverage for the resource quota functionality. Addressing the points above will enhance the robustness and extensibility of the implementation.
Signed-off-by: Jifei Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi, FouoF The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This pr add resource quota support for Nvidia multi-device request and whole device request.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: