Skip to content

Commit 713cfa7

Browse files
Merge pull request #18 from oracle-quickstart/known_issues_gpu4_issue
Adding BM.GPU4.8 issue to known issues.
2 parents 4769aa7 + a3d5d0b commit 713cfa7

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

docs/known_issues/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,20 @@ Place to record issues that arise and there corresponding workarounds.
66

77
1. Check your permissions and verify that they match exactly as shown here: [IAM Policies](../iam_policies)
88
2. Did you choose `*.nip.io` as your domain name when setting up Corrino? If so, this is an untrusted domain and will be blocked when behind VPN. Either choose to deploy Corrino via custom domain or access your `*.nip.io` Corrino domain outside of VPN
9+
10+
## Shape BM.GPU4.8 Cannot Schedule Recipes
11+
12+
Currently, there is an Oracle Kubernetes Engine (OKE) bug with the `BM.GPU4.8` shape. Since the toolkit runs on top of an OKE cluster, this shape cannot be used with the toolkit until the issue is resolved by OKE. We have diagnosed and reported the issue, and are following up with the OKE team for resolution. The error for this issue presents like:
13+
14+
The following `kubectl` commands can be used to diagnose pods in this state:
15+
```bash
16+
kubectl get pods # to find the name of the pod
17+
kubectl describe pod <pod-name>
18+
```
19+
20+
This will output all information about the pod. In the `Events:` section (at the very bottom) you will see information like this:
21+
```
22+
Pod info: nvidia-dcgm-node-feature-discovery-worker always gets stuck in container creating with warning / error like:
23+
Warning FailedCreatePodSandBox 12s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_gpu-operator-1738967226-node-feature-discovery-worker-dzwht_gpu-operator_06605d81-8dc8-48db-a9a9-b393e8bcd068_0
24+
```
25+
Where the nvidia-dcgm-node-feature-discovery-worker pod infinitely gets stuck in a "ContainerCreating" / "CrashLoopBackoff" cycle.

0 commit comments

Comments
 (0)