Skip to content

HAMi cannot allocate GPU correctly: new pod is scheduled to an already occupied card #1491

@clcc2019

Description

@clcc2019

What happened:
HAMi无法正确分配GPU资源:当一个节点有8张卡,运行4个pod,每个pod分配2张卡。在执行更新操作时,删除了一个pod并创建新pod,新的pod被调度到了已被占用的GPU卡上。

What you expected to happen:
期望新的pod能够被分配到未被占用的GPU卡上,资源分配应该正确反映当前节点的卡使用情况。

How to reproduce it (as minimally and precisely as possible):

  1. 在具有8张GPU卡的节点上部署4个pod,每个pod分配2张卡。
  2. 删除其中一个pod。
  3. 创建一个新的pod,观察其分配到的GPU卡。
  4. 发现新pod被调度到了被其他pod占用的卡上。

Anything else we need to know?:

  • 相关日志和配置可在问题跟进时补充。

Environment:

  • HAMi version:
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions