-
Notifications
You must be signed in to change notification settings - Fork 688
docs: Adds Qwen3-235B-A22B recipes for both precisions #4179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Elnifio <[email protected]>
|
👋 Hi Elnifio! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
Signed-off-by: Elnifio <[email protected]>
WalkthroughIntroduces Kubernetes manifests for deploying two Qwen model variants with disaggregated and aggregated inference strategies. Adds PersistentVolumeClaims for model caching, model download Jobs, DynamoGraphDeployment configurations with TensorRT-LLM backends, and aiperf benchmarking Jobs supporting both parallel prefill/decode and single-node inference patterns. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (2)
recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml (1)
15-44: Add securityContext to restrict privilege escalation and enforce non-root execution.Same security concern as the model-download job in
recipes/qwen3-235b-a22b/model-cache/model-download.yaml. The container runs as root without explicit restrictions on privilege escalation (CKV_K8S_20, CKV_K8S_23).Apply the same securityContext fix:
spec: restartPolicy: Never containers: - name: model-download image: python:3.10-slim command: ["sh", "-c"] + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 1000 envFrom: - secretRef: name: hf-token-secretrecipes/qwen3-235b-a22b/trtllm/agg/perf.yaml (1)
141-142: Eliminate privileged mode and run as non-root; justify if strictly necessary.Same critical security concern as the disaggregated perf Job (file
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/perf.yaml). The container runs withprivileged: true(CKV_K8S_16 – HIGH), which is a significant security risk.Apply the same fix:
securityContext: - privileged: true + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 1000 + capabilities: + drop: + - ALLVerify aiperf functionality with these restrictions and adjust capabilities only if specific syscalls fail.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-cache.yaml(1 hunks)recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml(1 hunks)recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml(1 hunks)recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/perf.yaml(1 hunks)recipes/qwen3-235b-a22b/model-cache/model-cache.yaml(1 hunks)recipes/qwen3-235b-a22b/model-cache/model-download.yaml(1 hunks)recipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml(1 hunks)recipes/qwen3-235b-a22b/trtllm/agg/perf.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.
📚 Learning: 2025-10-24T04:21:08.751Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 3858
File: recipes/deepseek-r1/model-cache/model-download.yaml:18-32
Timestamp: 2025-10-24T04:21:08.751Z
Learning: In the recipes directory structure, model-specific recipes (e.g., recipes/deepseek-r1/, recipes/llama-3-70b/) contain hardcoded model names and revisions in their Kubernetes manifests (like model-download.yaml). Each recipe directory is deployment-specific and self-contained, so hardcoding model-specific values is the intended design pattern.
Applied to files:
recipes/qwen3-235b-a22b/model-cache/model-cache.yamlrecipes/qwen3-235b-a22b/model-cache/model-download.yamlrecipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-cache.yamlrecipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.
Applied to files:
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yamlrecipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml
🪛 Checkov (3.2.334)
recipes/qwen3-235b-a22b/model-cache/model-download.yaml
[medium] 3-44: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 3-44: Minimize the admission of root containers
(CKV_K8S_23)
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/perf.yaml
[medium] 3-153: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 3-153: Container should not be privileged
(CKV_K8S_16)
[medium] 3-153: Minimize the admission of root containers
(CKV_K8S_23)
recipes/qwen3-235b-a22b-instruct-2507-fp8/model-cache/model-download.yaml
[medium] 3-44: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 3-44: Minimize the admission of root containers
(CKV_K8S_23)
recipes/qwen3-235b-a22b/trtllm/agg/perf.yaml
[medium] 3-153: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[high] 3-153: Container should not be privileged
(CKV_K8S_16)
[medium] 3-153: Minimize the admission of root containers
(CKV_K8S_23)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4179/merge) by Elnifio.
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml
[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook (trailing-whitespace) during pre-commit run.
recipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml
[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook (trailing-whitespace) during pre-commit run.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml (1)
8-59: Engine configuration strategy for disaggregation is well-tuned.The prefill.yaml and decode.yaml configs show thoughtful tuning for the disaggregated inference pattern:
- Prefill worker: Low batch size (2), tensor parallelism (2), optimized for throughput of short batches
- Decode worker: Higher batch size (512), tensor parallelism (4), optimized for sustained token generation
GPU allocation aligns with workload characteristics. The fp8 dtype and memory settings are appropriate for the Qwen3-235B model size.
| spec: | ||
| backendFramework: trtllm | ||
| pvcs: | ||
| - name: model-cache | ||
| create: false | ||
| services: | ||
| Frontend: | ||
| componentType: frontend | ||
| dynamoNamespace: qwen3-235b-a22b-disagg | ||
| replicas: 1 | ||
| extraPodSpec: | ||
| affinity: | ||
| podAntiAffinity: | ||
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| - labelSelector: | ||
| matchExpressions: | ||
| - key: nvidia.com/dynamo-graph-deployment-name | ||
| operator: In | ||
| values: | ||
| - qwen3-235b-a22b-disagg-frontend | ||
| topologyKey: kubernetes.io/hostname | ||
| mainContainer: | ||
| image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag | ||
| args: | ||
| - python3 -m dynamo.frontend --router-mode kv --http-port 8000 | ||
| command: | ||
| - /bin/sh | ||
| - -c | ||
| TRTLLMPrefillWorker: | ||
| componentType: worker | ||
| subComponentType: prefill | ||
| dynamoNamespace: qwen3-235b-a22b-disagg | ||
| envFromSecret: hf-token-secret | ||
| replicas: 1 | ||
| resources: | ||
| limits: | ||
| gpu: "2" | ||
| sharedMemory: | ||
| size: 256Gi | ||
| extraPodSpec: | ||
| affinity: | ||
| nodeAffinity: | ||
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| nodeSelectorTerms: | ||
| - matchExpressions: | ||
| - key: nvidia.com/gpu.present | ||
| operator: In | ||
| values: | ||
| - "true" | ||
| mainContainer: | ||
| env: | ||
| - name: MODEL_PATH | ||
| value: /mnt/model-cache/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507-FP8/snapshots/e156cb4efae43fbee1a1ab073f946a1377e6b969 | ||
| - name: ENGINE_ARGS | ||
| value: /engine_configs/prefill.yaml | ||
| image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag | ||
| workingDir: /workspace/components/backends/trtllm | ||
| command: | ||
| - /bin/sh | ||
| - -c | ||
| args: | ||
| - | | ||
| python3 -m dynamo.trtllm \ | ||
| --model-path "${MODEL_PATH}" \ | ||
| --served-model-name "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8" \ | ||
| --extra-engine-args "${ENGINE_ARGS}" \ | ||
| --disaggregation-mode prefill \ | ||
| --disaggregation-strategy prefill_first | ||
| volumeMounts: | ||
| - name: prefill-config | ||
| mountPath: /engine_configs | ||
| - name: model-cache | ||
| mountPath: /mnt/model-cache | ||
| volumes: | ||
| - name: prefill-config | ||
| configMap: | ||
| name: prefill-config | ||
| - name: model-cache | ||
| persistentVolumeClaim: | ||
| claimName: model-cache | ||
| TRTLLMDecodeWorker: | ||
| componentType: worker | ||
| subComponentType: decode | ||
| dynamoNamespace: qwen3-235b-a22b-disagg | ||
| envFromSecret: hf-token-secret | ||
| replicas: 1 | ||
| resources: | ||
| limits: | ||
| gpu: "4" | ||
| sharedMemory: | ||
| size: 256Gi | ||
| extraPodSpec: | ||
| affinity: | ||
| nodeAffinity: | ||
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| nodeSelectorTerms: | ||
| - matchExpressions: | ||
| - key: nvidia.com/gpu.present | ||
| operator: In | ||
| values: | ||
| - "true" | ||
| mainContainer: | ||
| env: | ||
| - name: MODEL_PATH | ||
| value: /mnt/model-cache/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507-FP8/snapshots/e156cb4efae43fbee1a1ab073f946a1377e6b969 | ||
| - name: ENGINE_ARGS | ||
| value: /engine_configs/decode.yaml | ||
| image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag | ||
| workingDir: /workspace/components/backends/trtllm | ||
| command: | ||
| - /bin/sh | ||
| - -c | ||
| args: | ||
| - | | ||
| python3 -m dynamo.trtllm \ | ||
| --model-path "${MODEL_PATH}" \ | ||
| --served-model-name "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8" \ | ||
| --extra-engine-args "${ENGINE_ARGS}" \ | ||
| --disaggregation-mode decode \ | ||
| --disaggregation-strategy prefill_first | ||
| volumeMounts: | ||
| - name: decode-config | ||
| mountPath: /engine_configs | ||
| - name: model-cache | ||
| mountPath: /mnt/model-cache | ||
| volumes: | ||
| - name: decode-config | ||
| configMap: | ||
| name: decode-config | ||
| - name: model-cache | ||
| persistentVolumeClaim: | ||
| claimName: model-cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix trailing whitespace detected by pre-commit.
The pre-commit hook reported trailing whitespace. Re-run locally to verify:
pre-commit run --all-files🤖 Prompt for AI Agents
In recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/deploy.yaml around
lines 65 to 196 there are trailing whitespace characters that fail pre-commit;
remove any trailing spaces at the ends of lines in that range (and any
accidental blank-line spaces), save the file, re-run pre-commit (pre-commit run
--all-files) to verify the fix, and commit the cleaned file.
| - qwen3-235b-a22b-disagg-frontend | ||
| topologyKey: kubernetes.io/hostname | ||
| mainContainer: | ||
| image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace placeholder image tags with a concrete version.
The image tag my-tag is a placeholder on lines 87, 120, and 172, causing ImagePullBackOff errors. Use an actual image version.
- image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+ image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:24.10Update all three occurrences (Frontend, TRTLLMPrefillWorker, and TRTLLMDecodeWorker).
Also applies to: 120-120, 172-172
| securityContext: | ||
| privileged: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eliminate privileged mode and run as non-root; justify if strictly necessary.
The container runs with privileged: true (CKV_K8S_16 – HIGH severity), which grants full host access. This is a significant security risk for a benchmarking Job.
If privileged mode is required for aiperf monitoring, explicitly document why and consider using specific Linux capabilities instead. If not strictly necessary, remove privileged mode:
securityContext:
- privileged: true
+ allowPrivilegeEscalation: false
+ runAsNonRoot: true
+ runAsUser: 1000
+ capabilities:
+ drop:
+ - ALLVerify aiperf functionality with these restrictions and adjust capabilities only if specific syscalls fail.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| securityContext: | |
| privileged: true | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| runAsNonRoot: true | |
| runAsUser: 1000 | |
| capabilities: | |
| drop: | |
| - ALL |
🤖 Prompt for AI Agents
In recipes/qwen3-235b-a22b-instruct-2507-fp8/trtllm/disagg/perf.yaml around
lines 141-142, remove the broad securityContext: privileged: true and instead
run the container as non-root and least-privileged: set
securityContext.runAsNonRoot: true and a specific non-root user uid (e.g.,
runAsUser: 1000), drop all unnecessary capabilities and only add explicit ones
needed for aiperf (addCapabilities: [CAP_NET_RAW] etc.) if specific syscalls
fail, and document the exact reason and capability justification in a comment;
only keep privileged: true after explicit justification and approval and verify
aiperf works under the tightened security context.
| spec: | ||
| restartPolicy: Never | ||
| containers: | ||
| - name: model-download | ||
| image: python:3.10-slim | ||
| command: ["sh", "-c"] | ||
| envFrom: | ||
| - secretRef: | ||
| name: hf-token-secret | ||
| env: | ||
| - name: MODEL_NAME | ||
| value: Qwen/Qwen3-235B-A22B | ||
| - name: HF_HOME | ||
| value: /model-store | ||
| - name: HF_HUB_ENABLE_HF_TRANSFER | ||
| value: "1" | ||
| - name: MODEL_REVISION | ||
| value: 8efa61729e24bd65b1d152b5ab5409052aa80e65 | ||
| args: | ||
| - | | ||
| set -eux | ||
| pip install --no-cache-dir huggingface_hub hf_transfer | ||
| hf download $MODEL_NAME --revision $MODEL_REVISION | ||
| volumeMounts: | ||
| - name: model-cache | ||
| mountPath: /model-store | ||
| volumes: | ||
| - name: model-cache | ||
| persistentVolumeClaim: | ||
| claimName: model-cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add securityContext to restrict privilege escalation and enforce non-root execution.
The container runs as root without explicit restrictions on privilege escalation. Per Checkov findings (CKV_K8S_20, CKV_K8S_23), add a securityContext to enforce security best practices.
Apply this diff to add a restrictive securityContext:
spec:
restartPolicy: Never
containers:
- name: model-download
image: python:3.10-slim
command: ["sh", "-c"]
+ securityContext:
+ allowPrivilegeEscalation: false
+ runAsNonRoot: true
+ runAsUser: 1000
envFrom:
- secretRef:
name: hf-token-secretNote: Ensure the container image supports running as user 1000, or adjust to an appropriate non-root UID.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| spec: | |
| restartPolicy: Never | |
| containers: | |
| - name: model-download | |
| image: python:3.10-slim | |
| command: ["sh", "-c"] | |
| envFrom: | |
| - secretRef: | |
| name: hf-token-secret | |
| env: | |
| - name: MODEL_NAME | |
| value: Qwen/Qwen3-235B-A22B | |
| - name: HF_HOME | |
| value: /model-store | |
| - name: HF_HUB_ENABLE_HF_TRANSFER | |
| value: "1" | |
| - name: MODEL_REVISION | |
| value: 8efa61729e24bd65b1d152b5ab5409052aa80e65 | |
| args: | |
| - | | |
| set -eux | |
| pip install --no-cache-dir huggingface_hub hf_transfer | |
| hf download $MODEL_NAME --revision $MODEL_REVISION | |
| volumeMounts: | |
| - name: model-cache | |
| mountPath: /model-store | |
| volumes: | |
| - name: model-cache | |
| persistentVolumeClaim: | |
| claimName: model-cache | |
| spec: | |
| restartPolicy: Never | |
| containers: | |
| - name: model-download | |
| image: python:3.10-slim | |
| command: ["sh", "-c"] | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| runAsNonRoot: true | |
| runAsUser: 1000 | |
| envFrom: | |
| - secretRef: | |
| name: hf-token-secret | |
| env: | |
| - name: MODEL_NAME | |
| value: Qwen/Qwen3-235B-A22B | |
| - name: HF_HOME | |
| value: /model-store | |
| - name: HF_HUB_ENABLE_HF_TRANSFER | |
| value: "1" | |
| - name: MODEL_REVISION | |
| value: 8efa61729e24bd65b1d152b5ab5409052aa80e65 | |
| args: | |
| - | | |
| set -eux | |
| pip install --no-cache-dir huggingface_hub hf_transfer | |
| hf download $MODEL_NAME --revision $MODEL_REVISION | |
| volumeMounts: | |
| - name: model-cache | |
| mountPath: /model-store | |
| volumes: | |
| - name: model-cache | |
| persistentVolumeClaim: | |
| claimName: model-cache |
🧰 Tools
🪛 Checkov (3.2.334)
[medium] 3-44: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 3-44: Minimize the admission of root containers
(CKV_K8S_23)
| - qwen3-235b-a22b-agg-frontend | ||
| topologyKey: kubernetes.io/hostname | ||
| mainContainer: | ||
| image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace placeholder image tag with a concrete version.
The image tag my-tag is a placeholder and will cause ImagePullBackOff errors at deployment time. Use an actual image version.
- image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
+ image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:24.10Verify the correct image tag with your registry or release notes and update both Frontend (line 58) and TrtllmWorker (line 95) containers.
Also applies to: 95-95
🤖 Prompt for AI Agents
In recipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml around lines 58 and 95, the
container image tag "my-tag" is a placeholder and must be replaced with a
concrete, released image tag; update the image fields for the Frontend (line 58)
and TrtllmWorker (line 95) containers to use the verified registry tag (e.g., a
semver or digest) you confirmed from the NVIDIA registry or release notes,
ensuring both occurrences match the exact published version.
| resources: | ||
| limits: | ||
| gpu: "4" | ||
| requests: | ||
| gpu: "4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix trailing whitespace detected by pre-commit.
The pre-commit hook reported trailing whitespace that was auto-fixed. Ensure no trailing spaces remain on lines with GPU resource limits and other specs.
Re-run pre-commit locally to verify all trailing whitespace is removed:
pre-commit run --all-files🤖 Prompt for AI Agents
In recipes/qwen3-235b-a22b/trtllm/agg/deploy.yaml around lines 110 to 114, there
are trailing whitespace characters on the GPU resource lines; remove any
trailing spaces at the ends of the affected lines (e.g., after "gpu: \"4\"" and
any other spec lines), save the file, and re-run pre-commit locally (pre-commit
run --all-files) to verify all trailing whitespace warnings are resolved.
Signed-off-by: Elnifio <[email protected]>
1095b8f to
d289661
Compare
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit