-
Notifications
You must be signed in to change notification settings - Fork 833
Open
Description
nebius mk8s node-group create \
--name "gpu-nodes" \
--parent-id $NB_CLUSTER_ID \
--fixed-node-count 5 \
--template-resources-platform "gpu-l40s-a" \
--template-resources-preset "1gpu-8vcpu-32gb" \
--template-gpu-settings-drivers-preset cuda12.8 \
--template-boot-disk-type network_ssd \
--template-boot-disk-size-bytes 137438953472 \
--template-cloud-init-user-data "$(cat <<EOF
users:
- name: $NODE_USERNAME
sudo: ALL=(ALL) NOPASSWD:ALL
shell: /bin/bash
ssh_authorized_keys:
- $(cat ~/.ssh/id_ed25519.pub)
EOF
)" \
--template-network-interfaces "[{\"public_ip_address\": {},
\"subnet_id\": \"$NB_SUBNET_ID\"}]"python -m sky.utils.kubernetes.gpu_labeler
Found 5 unlabeled GPU nodes in the cluster
Using nvidia RuntimeClass for GPU labeling.
Created GPU labeler job for node computeinstance-e00c2pvvejrgxgp35g
Created GPU labeler job for node computeinstance-e00n7hs0fjmqqhk0y3
Created GPU labeler job for node computeinstance-e00sfp1hy7zkhv83r4
Created GPU labeler job for node computeinstance-e00t1t95cx9drjzr6g
Created GPU labeler job for node computeinstance-e00zpg9ntzyyy5w2qg
Traceback (most recent call last):
File "/home/buildkite/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/buildkite/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 256, in <module>
main()
File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 252, in main
label(context=context, wait_for_completion=not args.async_completion)
File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 147, in label
success = wait_for_jobs_completion(jobs_to_node_names,
File "/home/buildkite/sky_workdir/skypilot/sky/utils/kubernetes/gpu_labeler.py", line 186, in wait_for_jobs_completion
for event in w.stream(func=batch_v1.list_namespaced_job,
File "/home/buildkite/miniconda3/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 202, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (504)
Reason: Timeout: Timeout: Too large resource version: 4400, current: 4395
❌ Error: GPU node labeling for SkyPilot failed
❌ Failed to create Nebius cluster
❌ Failed to create Nebius cluster. Exiting.kevinmingtarja
Metadata
Metadata
Assignees
Labels
No labels