-
Notifications
You must be signed in to change notification settings - Fork 419
Open
Labels
kind/bugSomething isn't workingSomething isn't working
Description
What happened:
The available resources of the node do not show the virtualized gpu memory.
What you expected to happen:
Display the virtualized gpu memory so that my pod can allocate nodes; otherwise, it will always be pending
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
hami-yaml
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: hami-ascend
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "update", "watch", "patch"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: hami-ascend
subjects:
- kind: ServiceAccount
name: hami-ascend
namespace: kube-system
roleRef:
kind: ClusterRole
name: hami-ascend
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: hami-ascend
namespace: kube-system
labels:
app.kubernetes.io/component: "hami-ascend"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: hami-ascend-device-plugin
namespace: kube-system
labels:
app.kubernetes.io/component: hami-ascend-device-plugin
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-ascend-device-plugin
hami.io/webhook: ignore
template:
metadata:
labels:
app.kubernetes.io/component: hami-ascend-device-plugin
hami.io/webhook: ignore
spec:
priorityClassName: "system-node-critical"
serviceAccountName: hami-ascend
containers:
- image: projecthami/ascend-device-plugin:v1.1.0
imagePullPolicy: IfNotPresent
name: device-plugin
resources:
requests:
memory: 500Mi
cpu: 500m
limits:
memory: 500Mi
cpu: 500m
args:
- --config_file
- /device-config.yaml
securityContext:
privileged: true
readOnlyRootFilesystem: false
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: pod-resource
mountPath: /var/lib/kubelet/pod-resources
- name: hiai-driver
mountPath: /usr/local/Ascend/driver
readOnly: true
- name: log-path
mountPath: /var/log/mindx-dl/devicePlugin
- name: tmp
mountPath: /tmp
- name: ascend-config
mountPath: /device-config.yaml
subPath: device-config.yaml
readOnly: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: pod-resource
hostPath:
path: /var/lib/kubelet/pod-resources
- name: hiai-driver
hostPath:
path: /usr/local/Ascend/driver
- name: log-path
hostPath:
path: /var/log/mindx-dl/devicePlugin
type: Directory
- name: tmp
hostPath:
path: /tmp
- name: ascend-config
configMap:
name: hami-scheduler-device
nodeSelector:
ascend: "on"
device-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: hami-scheduler-device
namespace: kube-system
data:
device-config.yaml: |-
vnpus:
- chipName: 310P3
commonWord: Ascend310P
resourceName: huawei.com/Ascend310P
resourceMemoryName: huawei.com/Ascend310P-memory
memoryAllocatable: 21527
memoryCapacity: 24576
aiCore: 8
aiCPU: 7
templates:
- name: vir02
memory: 6144
aiCore: 2
aiCPU: 2
Then the available resources of node changed from 8 to 24
allocatable:
cpu: "96"
ephemeral-storage: "1700179318837"
huawei.com/Ascend310P: "8"
hugepages-2Mi: "0"
memory: 526690192Ki
pods: "110"
capacity:
cpu: "96"
ephemeral-storage: 1844812632Ki
huawei.com/Ascend310P: "8"
hugepages-2Mi: "0"
memory: 526792592Ki
pods: "110"
->
->
->
allocatable:
cpu: "96"
ephemeral-storage: "1700179318837"
huawei.com/Ascend310P: "24"
hugepages-2Mi: "0"
memory: 526690192Ki
pods: "110"
capacity:
cpu: "96"
ephemeral-storage: 1844812632Ki
huawei.com/Ascend310P: "24"
hugepages-2Mi: "0"
memory: 526792592Ki
pods: "110"
But when I built the pod based on the example, the resource huawei.com/Ascend310P-memory was not registered in kublet. This has led to all the Pods I created being pending. May I ask if I did anything wrong
containers:
- name: npu_pod
...
resources:
limits:
huawei.com/Ascend910B: "1"
# 不填写显存默认使用整张卡
huawei.com/Ascend910B-memory: "4096"
- The output of
nvidia-smi -aon your host - Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json) - The hami-device-plugin container logs
- The hami-scheduler container logs
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet) - Any relevant kernel output lines from
dmesg
Environment:
- HAMi version:
- nvidia driver or other AI device driver version:
- Docker version from
docker version - Docker command, image and tag used
- Kernel version from
uname -a - Others:
Metadata
Metadata
Assignees
Labels
kind/bugSomething isn't workingSomething isn't working