Skip to content

The available resources of the node do not show the virtualized gpu memory. #1469

@rxy0210

Description

@rxy0210

What happened:

The available resources of the node do not show the virtualized gpu memory.

What you expected to happen:

Display the virtualized gpu memory so that my pod can allocate nodes; otherwise, it will always be pending

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
hami-yaml

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: hami-ascend
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "update", "watch", "patch"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hami-ascend
subjects:
  - kind: ServiceAccount
    name: hami-ascend
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: hami-ascend
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-ascend
  namespace: kube-system
  labels:
    app.kubernetes.io/component: "hami-ascend"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hami-ascend-device-plugin
  namespace: kube-system
  labels:
    app.kubernetes.io/component: hami-ascend-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-ascend-device-plugin
      hami.io/webhook: ignore
  template:
    metadata:
      labels:
        app.kubernetes.io/component: hami-ascend-device-plugin
        hami.io/webhook: ignore
    spec:
      priorityClassName: "system-node-critical"
      serviceAccountName: hami-ascend
      containers:
        - image: projecthami/ascend-device-plugin:v1.1.0
          imagePullPolicy: IfNotPresent
          name: device-plugin
          resources:
            requests:
              memory: 500Mi
              cpu: 500m
            limits:
              memory: 500Mi
              cpu: 500m
          args:
            - --config_file
            - /device-config.yaml
          securityContext:
            privileged: true
            readOnlyRootFilesystem: false
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: pod-resource
              mountPath: /var/lib/kubelet/pod-resources
            - name: hiai-driver
              mountPath: /usr/local/Ascend/driver
              readOnly: true
            - name: log-path
              mountPath: /var/log/mindx-dl/devicePlugin
            - name: tmp
              mountPath: /tmp
            - name: ascend-config
              mountPath: /device-config.yaml
              subPath: device-config.yaml
              readOnly: true
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: pod-resource
          hostPath:
            path: /var/lib/kubelet/pod-resources
        - name: hiai-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: log-path
          hostPath:
            path: /var/log/mindx-dl/devicePlugin
            type: Directory
        - name: tmp
          hostPath:
            path: /tmp
        - name: ascend-config
          configMap:
            name: hami-scheduler-device
      nodeSelector:
        ascend: "on"

device-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler-device
  namespace: kube-system
data:
  device-config.yaml: |-
    vnpus:
    - chipName: 310P3
      commonWord: Ascend310P
      resourceName: huawei.com/Ascend310P
      resourceMemoryName: huawei.com/Ascend310P-memory
      memoryAllocatable: 21527
      memoryCapacity: 24576
      aiCore: 8
      aiCPU: 7
      templates:
        - name: vir02
          memory: 6144
          aiCore: 2
          aiCPU: 2

Then the available resources of node changed from 8 to 24

  allocatable:
    cpu: "96"
    ephemeral-storage: "1700179318837"
    huawei.com/Ascend310P: "8"
    hugepages-2Mi: "0"
    memory: 526690192Ki
    pods: "110"
  capacity:
    cpu: "96"
    ephemeral-storage: 1844812632Ki
    huawei.com/Ascend310P: "8"
    hugepages-2Mi: "0"
    memory: 526792592Ki
    pods: "110"
->
->
->
  allocatable:
    cpu: "96"
    ephemeral-storage: "1700179318837"
    huawei.com/Ascend310P: "24"
    hugepages-2Mi: "0"
    memory: 526690192Ki
    pods: "110"
  capacity:
    cpu: "96"
    ephemeral-storage: 1844812632Ki
    huawei.com/Ascend310P: "24"
    hugepages-2Mi: "0"
    memory: 526792592Ki
    pods: "110"

But when I built the pod based on the example, the resource huawei.com/Ascend310P-memory was not registered in kublet. This has led to all the Pods I created being pending. May I ask if I did anything wrong

    containers:
    - name: npu_pod
      ...
      resources:
        limits:
          huawei.com/Ascend910B: "1"
          # 不填写显存默认使用整张卡
          huawei.com/Ascend910B-memory: "4096"
  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions