Skip to content

GPU label fail on nebius L40S clusters #7762

@zpoint

Description

@zpoint
nebius mk8s node-group create \
  --name "gpu-nodes" \
  --parent-id $NB_CLUSTER_ID \
  --fixed-node-count 5 \
  --template-resources-platform "gpu-l40s-a" \
  --template-resources-preset "1gpu-8vcpu-32gb" \
  --template-gpu-settings-drivers-preset cuda12.8 \
  --template-boot-disk-type network_ssd \
  --template-boot-disk-size-bytes 137438953472 \
  --template-cloud-init-user-data "$(cat <<EOF
users:
  - name: $NODE_USERNAME
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh_authorized_keys:
      - $(cat ~/.ssh/id_ed25519.pub)
EOF
)" \
  --template-network-interfaces "[{\"public_ip_address\": {}, 
                                   \"subnet_id\": \"$NB_SUBNET_ID\"}]"
python -m sky.utils.kubernetes.gpu_labeler

Found 5 unlabeled GPU nodes in the cluster
Using nvidia RuntimeClass for GPU labeling.
Created GPU labeler job for node computeinstance-e00c2pvvejrgxgp35g
Created GPU labeler job for node computeinstance-e00n7hs0fjmqqhk0y3
Created GPU labeler job for node computeinstance-e00sfp1hy7zkhv83r4
Created GPU labeler job for node computeinstance-e00t1t95cx9drjzr6g
Created GPU labeler job for node computeinstance-e00zpg9ntzyyy5w2qg
Traceback (most recent call last):
  File "/home/buildkite/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/buildkite/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 256, in <module>
    main()
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 252, in main
    label(context=context, wait_for_completion=not args.async_completion)
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 147, in label
    success = wait_for_jobs_completion(jobs_to_node_names,
  File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 186, in wait_for_jobs_completion
    for event in w.stream(func=batch_v1.list_namespaced_job,
  File "/home/buildkite/miniconda3/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 202, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (504)
Reason: Timeout: Timeout: Too large resource version: 4400, current: 4395

❌ Error: GPU node labeling for SkyPilot failed
❌ Failed to create Nebius cluster
❌ Failed to create Nebius cluster. Exiting.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions