Skip to content

GPU metric anomaly: Cache resources were not promptly cleared following pod deletion, resulting in resources continuing to be counted in the metric. #1458

@peachest

Description

@peachest

What happened:

HAMi version: v2.5.2

At a certain point, a NVIDIA GPU card was unoccupied by any Pods (nvidia-smi indicated no memory usage, and no Pod annotations showed binding to this card). However, the Prometheus metric nodevGPUMemoryAllocated reported by the HAMi scheduler showed the card's entire memory allocated. This persistent state prevented the actual idle memory resources from being properly allocated and resulted in erroneous monitoring metrics.

Screenshot of cluster HAMi metrics:
Image

Screenshot of actual resource usage of the node:
Image

What you expected to happen:

Upon deletion of a GPU Pod, the HAMi scheduler must accurately update the device resource usage status within its internal cache. Should a GPU card have no active Pods occupying it, the nodevGpumemoryallocated metric should correctly reflect either 0 or the actual usage amount. No resource allocation records pertaining to deleted Pods should remain.

After restarting the HAMi scheduler, the metric is correct. Screenshot as follows:
Image

How to reproduce it (as minimally and precisely as possible):

Presumed to be related to the Informer event handling mechanism (see next subsection), difficult to reproduce.

Anything else we need to know?:

Through source code analysis, the root cause of this issue may be as follows:

  • The HAMi scheduler listens Pod Add/Update/Delete events via the Informer and calls podManager.delPod() to clear the cache upon receiving a Delete event.
  • Should the eventHandler fail to complete processing of the Delete event (due to reasons including: 1. loss of the Delete event, or 2. the Informer unable to get the Pod's state and pass an object of DeletedFinalStateUnknown type into delete eventHandler), the resource information for the deleted Pod remains in the podManager cache.
  • getNodesUsage() recalculates node resource utilization based on this cache during each scheduling or periodically update in RegisterFromNodeAnnotations goroutine, causing released GPU memory to remain counted in Device.Usedmem and ultimately reflected in Prometheus metrics.

Currently, v2.5.2, subsequent branch versions, and the master branch lack mechanisms for periodic full cache synchronisation or fallback validation.

Interim Solution: Restart the HAMi scheduler to force cache rebuilding from the current Pod state.
Long-Term Solution: Implement a periodic full-sync mechanism (e.g., traversing all GPU Pods every hour to calibrate the podManager cache) and purge records of non-existent Pods.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions