GPU metric anomaly: Cache resources were not promptly cleared following pod deletion, resulting in resources continuing to be counted in the metric.

**What happened**:

HAMi version: v2.5.2

At a certain point, a NVIDIA GPU card was unoccupied by any Pods (nvidia-smi indicated no memory usage, and no Pod annotations showed binding to this card). However, the Prometheus metric `nodevGPUMemoryAllocated` reported by the HAMi scheduler showed the card's entire memory allocated. This persistent state prevented the actual idle memory resources from being properly allocated and resulted in erroneous monitoring metrics.

Screenshot of cluster HAMi metrics: 
<img width="706" height="87" alt="Image" src="https://github.com/user-attachments/assets/8691e3ec-ed94-472d-9d6b-f654f989b250" />

Screenshot of actual resource usage of the node:
<img width="828" height="923" alt="Image" src="https://github.com/user-attachments/assets/b713e1f8-fe4a-4696-be2c-91df4109b0a8" />

**What you expected to happen**:

Upon deletion of a GPU Pod, the HAMi scheduler must accurately update the device resource usage status within its internal cache. Should a GPU card have no active Pods occupying it, the `nodevGpumemoryallocated` metric should correctly reflect either 0 or the actual usage amount. No resource allocation records pertaining to deleted Pods should remain.

After restarting the HAMi scheduler, the metric is correct. Screenshot as follows:
<img width="706" height="81" alt="Image" src="https://github.com/user-attachments/assets/2ed58836-c1e7-49f9-8bdd-8eccfa2717d4" />


**How to reproduce it (as minimally and precisely as possible)**:

Presumed to be related to the Informer event handling mechanism (see next subsection), difficult to reproduce.

**Anything else we need to know?**:

Through source code analysis, the root cause of this issue may be as follows:

- The HAMi scheduler listens Pod Add/Update/Delete events via the Informer and calls `podManager.delPod()` to clear the cache upon receiving a Delete event.
- Should the eventHandler fail to complete processing of the Delete event (due to reasons including: 1. loss of the Delete event, or   2. the Informer unable to get the Pod's state and pass an object of `DeletedFinalStateUnknown` type into delete eventHandler), the resource information for the deleted Pod remains in the `podManager` cache.
- `getNodesUsage()` recalculates node resource utilization based on this cache during each scheduling or periodically update in `RegisterFromNodeAnnotations` goroutine, causing released GPU memory to remain counted in `Device.Usedmem` and ultimately reflected in Prometheus metrics.


Currently, v2.5.2, subsequent branch versions, and the master branch lack mechanisms for periodic full cache synchronisation or fallback validation.

**Interim Solution**: Restart the HAMi scheduler to force cache rebuilding from the current Pod state.  
**Long-Term Solution**: Implement a periodic full-sync mechanism (e.g., traversing all GPU Pods every hour to calibrate the `podManager` cache) and purge records of non-existent Pods.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU metric anomaly: Cache resources were not promptly cleared following pod deletion, resulting in resources continuing to be counted in the metric. #1458

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU metric anomaly: Cache resources were not promptly cleared following pod deletion, resulting in resources continuing to be counted in the metric. #1458

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions