feat: add NVIDIA Resourcequota #1359

FouoF · 2025-09-22T03:33:18Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This pr add resource quota support for Nvidia multi-device request and whole device request.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

gemini-code-assist · 2025-09-22T03:33:43Z

Summary of Changes

Hello @FouoF, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature by adding resource quota support for NVIDIA GPUs. It enables administrators to define and enforce limits on GPU memory and core usage at the namespace level, for both individual and multiple GPU requests. The changes involve a new, centralized quota management system, integration of quota checks directly into the device allocation logic, and updates to the scheduler to dynamically track and respond to Kubernetes ResourceQuota objects, ensuring efficient and fair resource distribution.

Highlights

NVIDIA Resource Quota Support: Introduced a new feature to support resource quotas for NVIDIA GPUs, covering both multi-device and whole-device requests.
Quota Management System: A new QuotaManager is implemented in pkg/device/quota.go to track and enforce resource limits across namespaces.
Standardized Resource Naming: The Devices interface now includes GetResourceNames(), and all device implementations (NVIDIA, Ascend, AWSNeuron, etc.) are updated to provide their specific resource names.
Integrated Quota Checks: The Fit method for NVIDIA GPUs now incorporates a fitQuota function to validate resource requests against defined quotas before allocation.
Scheduler Integration: The scheduler is enhanced to watch Kubernetes ResourceQuota objects and manage their state, updating quota usage when pods are added or deleted.
New Reason for Unfit Pods: A ResourceQuotaNotFit reason has been added to pkg/device/common/common.go to indicate when a pod cannot be scheduled due to quota limitations.
Comprehensive Test Cases: New unit tests are added to pkg/scheduler/scheduler_test.go to verify the correct behavior of the resource quota system under various scenarios, including successful allocations and quota denials.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces resource quota support, primarily targeting NVIDIA GPUs. It adds a QuotaManager to track usage and limits for GPU memory and cores, and integrates this with the scheduler. The Devices interface is updated to support this, though some parts of the implementation are specific to NVIDIA.

My review has identified a few critical issues, including a hardcoded check for NVIDIA devices that prevents the feature from being generic, and a deadlock in the pod deletion logic. I've also pointed out several medium-severity issues related to thread safety, potential panics, and code robustness in the new quota management logic. Addressing these points will improve the correctness and reliability of this new feature.

pkg/device/quota.go

pkg/scheduler/scheduler.go

pkg/device/nvidia/device.go

pkg/device/quota.go

codecov · 2025-09-22T03:38:47Z

Codecov Report

❌ Patch coverage is 56.72043% with 161 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pkg/device/quota.go	77.07%	24 Missing and 12 partials ⚠️
pkg/scheduler/scheduler.go	38.46%	21 Missing and 3 partials ⚠️
pkg/device/nvidia/device.go	46.15%	12 Missing and 2 partials ⚠️
pkg/device/kunlun/vdevice.go	0.00%	13 Missing ⚠️
pkg/device/kunlun/device.go	0.00%	10 Missing ⚠️
pkg/device/ascend/device.go	46.66%	8 Missing ⚠️
pkg/device/enflame/device.go	56.25%	7 Missing ⚠️
pkg/device/awsneuron/device.go	40.00%	6 Missing ⚠️
pkg/device/cambricon/device.go	45.45%	6 Missing ⚠️
pkg/device/enflame/gcu.go	45.45%	6 Missing ⚠️
... and 6 more

Flag	Coverage Δ
unittests	`63.75% <56.72%> (-0.29%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pkg/device/common/common.go	`0.00% <ø> (ø)`
pkg/device/devices.go	`78.65% <100.00%> (+0.79%)`	⬆️
pkg/scheduler/pods.go	`100.00% <100.00%> (ø)`
pkg/util/util.go	`70.94% <100.00%> (ø)`
pkg/scheduler/config/config.go	`73.58% <50.00%> (ø)`
pkg/device/awsneuron/device.go	`75.81% <40.00%> (-1.16%)`	⬇️
pkg/device/cambricon/device.go	`82.60% <45.45%> (-1.53%)`	⬇️
pkg/device/enflame/gcu.go	`90.40% <45.45%> (-4.43%)`	⬇️
pkg/device/hygon/device.go	`92.94% <50.00%> (-2.19%)`	⬇️
pkg/device/iluvatar/device.go	`70.86% <45.45%> (-1.53%)`	⬇️
... and 10 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]>

FouoF · 2025-09-22T05:18:04Z

cc @archlitchi

wawa0210 · 2025-09-22T05:18:16Z

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

cc @archlitchi

wawa0210 · 2025-09-22T08:59:06Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces resource quota support, primarily for NVIDIA devices. It adds a QuotaManager to track resource usage against limits defined in ResourceQuota objects. The scheduler is updated to use this manager during pod filtering and to keep usage up-to-date based on pod and quota events. The implementation is mostly generic, but a key function for calculating pod resource usage is hardcoded for NVIDIA, which limits the feature's applicability to other device types. Additionally, there's a minor point about string manipulation that could be improved for safety. The new tests for the quota feature are comprehensive and well-written.

pkg/device/quota.go

Shouren

@FouoF Is there any docs about how to config with Resourcequota in https://github.com/Project-HAMi/website ?

FouoF · 2025-09-23T06:08:38Z

@FouoF Is there any docs about how to config with Resourcequota in https://github.com/Project-HAMi/website ?

Project-HAMi/website#104

pkg/device/enflame/device.go

pkg/scheduler/scheduler.go

Signed-off-by: Jifei Wang <[email protected]>

Signed-off-by: limengxuan <[email protected]>

pkg/device/nvidia/device.go

Signed-off-by: limengxuan <[email protected]>

archlitchi · 2025-09-25T07:29:34Z

@FouoF I think it's better to add this quota usage to scheduler monitor

Signed-off-by: Jifei Wang <[email protected]>

FouoF · 2025-09-25T08:16:34Z

@FouoF I think it's better to add this quota usage to scheduler monitor

added

Shouren · 2025-09-25T08:36:16Z

pkg/scheduler/scheduler.go

 	podDev, _ := device.DecodePodDevices(device.SupportDevices, pod.Annotations)
-	s.addPod(pod, nodeID, podDev)
+	if s.addPod(pod, nodeID, podDev) {
+		s.quotaManager.AddUsage(pod, podDev)


@FouoF If this is the only place to call AddUsage of QuotaManager, there is a chance that ResourceQuota may not work well in a situation that multiple Pods calls the FitQuota of QuotaManager simultaneously.

both addPod and AddUsage are protected by their mutex lock, so i think it's thread-safe

@archlitchi It is not about the mutex lock that make it thread-safe. It is about the timing when to add usage to the QuotaManager.

Consider the following situation that quota for target namespace remains resources fitting just one Pod and now Pod A and B call the FitQuota simultaneously in Filter phase, because Used counter inside QuotaManager has not been added up right after the first one calls, both will pass the quota check(let me know if i miss understand something).

@FouoF @archlitchi I checked the code of ResourceQuota admission plugin in Kubernetes API Server, it used the workqueue which is indexed by namespace, perhaps we can take it as reference.

Can a namespace scope mutex lock solve it? If this namespace quota is in use, directly deny it.

@FouoF I think a namespace-scoped reentrantLock which can be acquired in FitQuota and released in AddUsage may solve it.

@archlitchi It is not about the mutex lock that make it thread-safe. It is about the timing when to add usage to the QuotaManager.

Consider the following situation that quota for target namespace remains resources fitting just one Pod and now Pod A and B call the FitQuota simultaneously in Filter phase, because Used counter inside QuotaManager has not been added up right after the first one calls, both will pass the quota check(let me know if i miss understand something).

yes, you are right, but this problem is beyond the resourceQuota scope, it explains this issue: #1330, when multiple tasks pass the Fit() and before AddResource(). I think we need to carefully design how to sync these on v2.8, and add it to RoadMap

@archlitchi All right. I suggest adding a note about concurrency issue in doc of resourceQuota in this version to remind user that it may not work under high pressure.

Signed-off-by: Jifei Wang <[email protected]>

wawa0210 · 2025-09-26T02:18:27Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces resource quota support for NVIDIA devices, which is a valuable addition. The implementation includes a QuotaManager to track resource usage against defined quotas, updates to the scheduler to enforce quotas, and new metrics for monitoring. My review has identified a few areas for improvement:

A leftover debug log statement should be cleaned up.
The quota calculation logic is currently hardcoded for NVIDIA devices, which limits the feature's applicability to other device types. Generalizing this would be beneficial.
The node informer is no longer handling update events, which could cause the scheduler to operate on stale node information.

Overall, the changes are well-structured, and the new tests provide good coverage for the resource quota functionality. Addressing the points above will enhance the robustness and extensibility of the implementation.

pkg/device/quota.go

pkg/scheduler/scheduler.go

pkg/device/devices.go

Signed-off-by: Jifei Wang <[email protected]>

archlitchi

/lgtm

hami-robot · 2025-09-26T03:45:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, FouoF

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [archlitchi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot bot added kind/feature new function dco-signoff: yes labels Sep 22, 2025

hami-robot bot requested review from chaunceyjiang and ouyangluwei163 September 22, 2025 03:33

hami-robot bot added the size/XL label Sep 22, 2025

gemini-code-assist bot reviewed Sep 22, 2025

View reviewed changes

FouoF had a problem deploying to nvidia September 22, 2025 03:46 — with GitHub Actions Failure

hami-robot bot added dco-signoff: no and removed dco-signoff: yes labels Sep 22, 2025

FouoF force-pushed the resourcequota branch from 6b1374c to 18f059d Compare September 22, 2025 03:57

hami-robot bot added dco-signoff: yes and removed dco-signoff: no labels Sep 22, 2025

add resourcequota

e9c1615

Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]> Signed-off-by: Jifei Wang <[email protected]>

FouoF force-pushed the resourcequota branch from 18f059d to e9c1615 Compare September 22, 2025 04:03

FouoF had a problem deploying to nvidia September 22, 2025 04:10 — with GitHub Actions Failure

FouoF requested a review from archlitchi September 22, 2025 04:27

FouoF had a problem deploying to nvidia September 22, 2025 05:09 — with GitHub Actions Failure

gemini-code-assist bot reviewed Sep 22, 2025

View reviewed changes

pkg/device/quota.go Outdated Show resolved Hide resolved

pkg/device/quota.go Show resolved Hide resolved

Shouren reviewed Sep 22, 2025

View reviewed changes

FouoF had a problem deploying to nvidia September 23, 2025 01:45 — with GitHub Actions Failure

Shouren reviewed Sep 24, 2025

View reviewed changes

pkg/device/enflame/device.go Show resolved Hide resolved

Shouren reviewed Sep 24, 2025

View reviewed changes

pkg/scheduler/scheduler.go Show resolved Hide resolved

add quota_test

6aca355

Signed-off-by: Jifei Wang <[email protected]>

FouoF force-pushed the resourcequota branch from 65d8ed8 to 6aca355 Compare September 24, 2025 10:55

update

3f013a9

Signed-off-by: limengxuan <[email protected]>

Shouren reviewed Sep 25, 2025

View reviewed changes

pkg/device/nvidia/device.go Show resolved Hide resolved

update

23af77a

Signed-off-by: limengxuan <[email protected]>

hami-robot bot added size/XXL and removed size/XL labels Sep 25, 2025

archlitchi temporarily deployed to nvidia September 25, 2025 07:30 — with GitHub Actions Inactive

hami-robot bot added dco-signoff: no and removed dco-signoff: yes labels Sep 25, 2025

add quota metrics

d2957d4

Signed-off-by: Jifei Wang <[email protected]>

FouoF force-pushed the resourcequota branch from a923caf to d2957d4 Compare September 25, 2025 07:41

hami-robot bot added dco-signoff: yes and removed dco-signoff: no labels Sep 25, 2025

FouoF temporarily deployed to nvidia September 25, 2025 07:49 — with GitHub Actions Inactive

Shouren reviewed Sep 25, 2025

View reviewed changes

fix fail update quota

895b932

Signed-off-by: Jifei Wang <[email protected]>

FouoF had a problem deploying to nvidia September 25, 2025 11:12 — with GitHub Actions Failure

FouoF temporarily deployed to nvidia September 26, 2025 01:49 — with GitHub Actions Inactive

FouoF mentioned this pull request Sep 26, 2025

update metrics Project-HAMi/website#107

Merged

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

pkg/device/quota.go Outdated Show resolved Hide resolved

pkg/scheduler/scheduler.go Show resolved Hide resolved

pkg/device/devices.go Outdated Show resolved Hide resolved

resolve comment

16f3b41

Signed-off-by: Jifei Wang <[email protected]>

FouoF temporarily deployed to nvidia September 26, 2025 02:50 — with GitHub Actions Inactive

archlitchi approved these changes Sep 26, 2025

View reviewed changes

hami-robot bot assigned archlitchi Sep 26, 2025

hami-robot bot added the lgtm label Sep 26, 2025

hami-robot bot added the approved label Sep 26, 2025

archlitchi merged commit 52c086c into Project-HAMi:master Sep 26, 2025
15 of 17 checks passed

feat: add NVIDIA Resourcequota #1359

feat: add NVIDIA Resourcequota #1359

Uh oh!

Conversation

FouoF commented Sep 22, 2025

Uh oh!

gemini-code-assist bot commented Sep 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

FouoF commented Sep 22, 2025

Uh oh!

wawa0210 commented Sep 22, 2025

Uh oh!

wawa0210 commented Sep 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Shouren left a comment

Choose a reason for hiding this comment

Uh oh!

FouoF commented Sep 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

archlitchi commented Sep 25, 2025

Uh oh!

FouoF commented Sep 25, 2025

Uh oh!

Shouren Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

archlitchi Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Shouren Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shouren Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

FouoF Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Shouren Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

archlitchi Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Shouren Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

wawa0210 commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 22, 2025 •

edited

Loading

Shouren Sep 25, 2025 •

edited

Loading

Shouren Sep 25, 2025 •

edited

Loading