-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
If a cluster has more than one type of NPU, eg. both Ascend910B3 and Ascend910B4 in same cluster.
Only workloads using certain NPU type will be scheduled, the pods using other NPUs will Pending forever
Steps to reproduce the issue
-
prepared a cluster has more than one NPU type, in my case, i have two nodes:
- NodeA: Ascend910B3 * 8
- NodeB: Ascend910B4 * 8
use volcano scheduler in master branch, and turned on vNPU HAMi mode according to this doc
-
prepared two workload
apiVersion: apps/v1 kind: Deployment metadata: name: npu-test-deployment-A spec: replicas: 1 selector: matchLabels: app: npu-test template: metadata: labels: app: npu-test spec: containers: - name: finetune-training-container image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04 imagePullPolicy: Always args: [ "sleep", "infinity" ] resources: limits: huawei.com/Ascend910B3: 1 huawei.com/Ascend910B3-memory: 32768 restartPolicy: Always schedulerName: volcano
apiVersion: apps/v1 kind: Deployment metadata: name: npu-test-deployment-B spec: replicas: 1 selector: matchLabels: app: npu-test template: metadata: labels: app: npu-test spec: containers: - name: finetune-training-container image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04 imagePullPolicy: Always args: [ "sleep", "infinity" ] resources: limits: huawei.com/Ascend910B4: 1 huawei.com/Ascend910B4-memory: 32768 restartPolicy: Always schedulerName: volcano
these two deployments has same spec except card type
-
apply these yaml, only npu-test-deployment-B's pod (which use Ascend910B4) will pending forever.
After checking correlated podgroup's event, it show these warnings:---- ------ ---- ---- ------- Normal Unschedulable 24m (x2 over 24m) volcano resource in cluster is overused: overused huawei.com/Ascend910B4-memory Warning Unschedulable 4m39s (x1186 over 24m) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Describe the results you received and expected
The workload using NPUs other than Ascend910B3 will Pending forever.
And the warning in podgroup should not exist since huawei.com/Ascend910B4-memory should inside ignore list.
What version of Volcano are you using?
latest
Any other relevant information
No response