Skip to content

Conversation

FouoF
Copy link
Contributor

@FouoF FouoF commented Sep 22, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This pr add resource quota support for Nvidia multi-device request and whole device request.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Copy link
Contributor

Summary of Changes

Hello @FouoF, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature by adding resource quota support for NVIDIA GPUs. It enables administrators to define and enforce limits on GPU memory and core usage at the namespace level, for both individual and multiple GPU requests. The changes involve a new, centralized quota management system, integration of quota checks directly into the device allocation logic, and updates to the scheduler to dynamically track and respond to Kubernetes ResourceQuota objects, ensuring efficient and fair resource distribution.

Highlights

  • NVIDIA Resource Quota Support: Introduced a new feature to support resource quotas for NVIDIA GPUs, covering both multi-device and whole-device requests.
  • Quota Management System: A new QuotaManager is implemented in pkg/device/quota.go to track and enforce resource limits across namespaces.
  • Standardized Resource Naming: The Devices interface now includes GetResourceNames(), and all device implementations (NVIDIA, Ascend, AWSNeuron, etc.) are updated to provide their specific resource names.
  • Integrated Quota Checks: The Fit method for NVIDIA GPUs now incorporates a fitQuota function to validate resource requests against defined quotas before allocation.
  • Scheduler Integration: The scheduler is enhanced to watch Kubernetes ResourceQuota objects and manage their state, updating quota usage when pods are added or deleted.
  • New Reason for Unfit Pods: A ResourceQuotaNotFit reason has been added to pkg/device/common/common.go to indicate when a pod cannot be scheduled due to quota limitations.
  • Comprehensive Test Cases: New unit tests are added to pkg/scheduler/scheduler_test.go to verify the correct behavior of the resource quota system under various scenarios, including successful allocations and quota denials.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces resource quota support, primarily targeting NVIDIA GPUs. It adds a QuotaManager to track usage and limits for GPU memory and cores, and integrates this with the scheduler. The Devices interface is updated to support this, though some parts of the implementation are specific to NVIDIA.

My review has identified a few critical issues, including a hardcoded check for NVIDIA devices that prevents the feature from being generic, and a deadlock in the pod deletion logic. I've also pointed out several medium-severity issues related to thread safety, potential panics, and code robustness in the new quota management logic. Addressing these points will improve the correctness and reliability of this new feature.

Copy link

codecov bot commented Sep 22, 2025

Codecov Report

❌ Patch coverage is 56.72043% with 161 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/device/quota.go 77.07% 24 Missing and 12 partials ⚠️
pkg/scheduler/scheduler.go 38.46% 21 Missing and 3 partials ⚠️
pkg/device/nvidia/device.go 46.15% 12 Missing and 2 partials ⚠️
pkg/device/kunlun/vdevice.go 0.00% 13 Missing ⚠️
pkg/device/kunlun/device.go 0.00% 10 Missing ⚠️
pkg/device/ascend/device.go 46.66% 8 Missing ⚠️
pkg/device/enflame/device.go 56.25% 7 Missing ⚠️
pkg/device/awsneuron/device.go 40.00% 6 Missing ⚠️
pkg/device/cambricon/device.go 45.45% 6 Missing ⚠️
pkg/device/enflame/gcu.go 45.45% 6 Missing ⚠️
... and 6 more
Flag Coverage Δ
unittests 63.75% <56.72%> (-0.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/device/common/common.go 0.00% <ø> (ø)
pkg/device/devices.go 78.65% <100.00%> (+0.79%) ⬆️
pkg/scheduler/pods.go 100.00% <100.00%> (ø)
pkg/util/util.go 70.94% <100.00%> (ø)
pkg/scheduler/config/config.go 73.58% <50.00%> (ø)
pkg/device/awsneuron/device.go 75.81% <40.00%> (-1.16%) ⬇️
pkg/device/cambricon/device.go 82.60% <45.45%> (-1.53%) ⬇️
pkg/device/enflame/gcu.go 90.40% <45.45%> (-4.43%) ⬇️
pkg/device/hygon/device.go 92.94% <50.00%> (-2.19%) ⬇️
pkg/device/iluvatar/device.go 70.86% <45.45%> (-1.53%) ⬇️
... and 10 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Jifei Wang <[email protected]>

Signed-off-by: Jifei Wang <[email protected]>

Signed-off-by: Jifei Wang <[email protected]>

Signed-off-by: Jifei Wang <[email protected]>

Signed-off-by: Jifei Wang <[email protected]>
@FouoF
Copy link
Contributor Author

FouoF commented Sep 22, 2025

cc @archlitchi

@wawa0210
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


cc @archlitchi

@wawa0210
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces resource quota support, primarily for NVIDIA devices. It adds a QuotaManager to track resource usage against limits defined in ResourceQuota objects. The scheduler is updated to use this manager during pod filtering and to keep usage up-to-date based on pod and quota events. The implementation is mostly generic, but a key function for calculating pod resource usage is hardcoded for NVIDIA, which limits the feature's applicability to other device types. Additionally, there's a minor point about string manipulation that could be improved for safety. The new tests for the quota feature are comprehensive and well-written.

Copy link
Collaborator

@Shouren Shouren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FouoF Is there any docs about how to config with Resourcequota in https://github.com/Project-HAMi/website ?

@FouoF
Copy link
Contributor Author

FouoF commented Sep 23, 2025

@FouoF Is there any docs about how to config with Resourcequota in https://github.com/Project-HAMi/website ?

Project-HAMi/website#104

Signed-off-by: Jifei Wang <[email protected]>
Signed-off-by: limengxuan <[email protected]>
Signed-off-by: limengxuan <[email protected]>
@hami-robot hami-robot bot added size/XXL and removed size/XL labels Sep 25, 2025
@archlitchi
Copy link
Member

@FouoF I think it's better to add this quota usage to scheduler monitor

Signed-off-by: Jifei Wang <[email protected]>
@FouoF
Copy link
Contributor Author

FouoF commented Sep 25, 2025

@FouoF I think it's better to add this quota usage to scheduler monitor

added

podDev, _ := device.DecodePodDevices(device.SupportDevices, pod.Annotations)
s.addPod(pod, nodeID, podDev)
if s.addPod(pod, nodeID, podDev) {
s.quotaManager.AddUsage(pod, podDev)
Copy link
Collaborator

@Shouren Shouren Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FouoF If this is the only place to call AddUsage of QuotaManager, there is a chance that ResourceQuota may not work well in a situation that multiple Pods calls the FitQuota of QuotaManager simultaneously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both addPod and AddUsage are protected by their mutex lock, so i think it's thread-safe

Copy link
Collaborator

@Shouren Shouren Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archlitchi It is not about the mutex lock that make it thread-safe. It is about the timing when to add usage to the QuotaManager.

Consider the following situation that quota for target namespace remains resources fitting just one Pod and now Pod A and B call the FitQuota simultaneously in Filter phase, because Used counter inside QuotaManager has not been added up right after the first one calls, both will pass the quota check(let me know if i miss understand something).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FouoF @archlitchi I checked the code of ResourceQuota admission plugin in Kubernetes API Server, it used the workqueue which is indexed by namespace, perhaps we can take it as reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a namespace scope mutex lock solve it? If this namespace quota is in use, directly deny it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FouoF I think a namespace-scoped reentrantLock which can be acquired in FitQuota and released in AddUsage may solve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archlitchi It is not about the mutex lock that make it thread-safe. It is about the timing when to add usage to the QuotaManager.

Consider the following situation that quota for target namespace remains resources fitting just one Pod and now Pod A and B call the FitQuota simultaneously in Filter phase, because Used counter inside QuotaManager has not been added up right after the first one calls, both will pass the quota check(let me know if i miss understand something).

yes, you are right, but this problem is beyond the resourceQuota scope, it explains this issue: #1330, when multiple tasks pass the Fit() and before AddResource(). I think we need to carefully design how to sync these on v2.8, and add it to RoadMap

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archlitchi All right. I suggest adding a note about concurrency issue in doc of resourceQuota in this version to remind user that it may not work under high pressure.

Signed-off-by: Jifei Wang <[email protected]>
@wawa0210
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces resource quota support for NVIDIA devices, which is a valuable addition. The implementation includes a QuotaManager to track resource usage against defined quotas, updates to the scheduler to enforce quotas, and new metrics for monitoring. My review has identified a few areas for improvement:

  • A leftover debug log statement should be cleaned up.
  • The quota calculation logic is currently hardcoded for NVIDIA devices, which limits the feature's applicability to other device types. Generalizing this would be beneficial.
  • The node informer is no longer handling update events, which could cause the scheduler to operate on stale node information.

Overall, the changes are well-structured, and the new tests provide good coverage for the resource quota functionality. Addressing the points above will enhance the robustness and extensibility of the implementation.

Signed-off-by: Jifei Wang <[email protected]>
Copy link
Member

@archlitchi archlitchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot
Copy link
Contributor

hami-robot bot commented Sep 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, FouoF

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot bot added the approved label Sep 26, 2025
@archlitchi archlitchi merged commit 52c086c into Project-HAMi:master Sep 26, 2025
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants