-
Notifications
You must be signed in to change notification settings - Fork 418
Description
What happened:
HAMi version: v2.5.2
At a certain point, a NVIDIA GPU card was unoccupied by any Pods (nvidia-smi indicated no memory usage, and no Pod annotations showed binding to this card). However, the Prometheus metric nodevGPUMemoryAllocated reported by the HAMi scheduler showed the card's entire memory allocated. This persistent state prevented the actual idle memory resources from being properly allocated and resulted in erroneous monitoring metrics.
Screenshot of cluster HAMi metrics:

Screenshot of actual resource usage of the node:

What you expected to happen:
Upon deletion of a GPU Pod, the HAMi scheduler must accurately update the device resource usage status within its internal cache. Should a GPU card have no active Pods occupying it, the nodevGpumemoryallocated metric should correctly reflect either 0 or the actual usage amount. No resource allocation records pertaining to deleted Pods should remain.
After restarting the HAMi scheduler, the metric is correct. Screenshot as follows:

How to reproduce it (as minimally and precisely as possible):
Presumed to be related to the Informer event handling mechanism (see next subsection), difficult to reproduce.
Anything else we need to know?:
Through source code analysis, the root cause of this issue may be as follows:
- The HAMi scheduler listens Pod Add/Update/Delete events via the Informer and calls
podManager.delPod()to clear the cache upon receiving a Delete event. - Should the eventHandler fail to complete processing of the Delete event (due to reasons including: 1. loss of the Delete event, or 2. the Informer unable to get the Pod's state and pass an object of
DeletedFinalStateUnknowntype into delete eventHandler), the resource information for the deleted Pod remains in thepodManagercache. getNodesUsage()recalculates node resource utilization based on this cache during each scheduling or periodically update inRegisterFromNodeAnnotationsgoroutine, causing released GPU memory to remain counted inDevice.Usedmemand ultimately reflected in Prometheus metrics.
Currently, v2.5.2, subsequent branch versions, and the master branch lack mechanisms for periodic full cache synchronisation or fallback validation.
Interim Solution: Restart the HAMi scheduler to force cache rebuilding from the current Pod state.
Long-Term Solution: Implement a periodic full-sync mechanism (e.g., traversing all GPU Pods every hour to calibrate the podManager cache) and purge records of non-existent Pods.