Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Nov 9, 2025

Overview:

Fix DeepEP package discovery issues in fault tolerance testing with local vLLM images.

This PR addresses the bug discovered in DIS-994 where Dynamo/vLLM couldn't find deep-ep packages but could find pplx. The updated Dockerfile includes comprehensive path fixing logic for editable Python packages.

Details:

  • Updated tests/fault_tolerance/deploy/container/Dockerfile.local_vllm with fixes for DeepEP package discovery
  • Root Cause: When copying vLLM's Python packages from a pre-built image, the .pth files and finder modules contained hardcoded paths (/vllm-workspace) that didn't match the new location in the Dynamo container (/opt/vllm)
  • Solution: Added comprehensive path fixing logic (lines 105-113) that:
    • Fixes ALL DeepEP .pth files using version-agnostic pattern matching (since version includes git hash)
    • Fixes DeepEP finder modules (__editable___deep_ep*_finder.py) that contain hardcoded MAPPING paths
    • Performs a catch-all fix for any remaining .pth files that reference the old path

This ensures that DeepEP packages are properly discoverable when using local/pre-built vLLM images with Dynamo.

Where should the reviewer start?

  • tests/fault_tolerance/deploy/container/Dockerfile.local_vllm - Lines 104-113: The critical path fixing logic that resolves DIS-994
    • Note the use of find with regex patterns to handle version-agnostic matching for DeepEP packages
    • The finder modules fix is essential as they contain the actual hardcoded MAPPING paths
  • Lines 96-102: Context on how vLLM packages are copied from the source image
  • Lines 116-122: LD_LIBRARY_PATH configuration ensuring libraries are found at runtime

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-994

Summary by CodeRabbit

  • Chores
    • Enhanced container deployment configuration with more robust path handling
    • Improved version-agnostic file management for better compatibility across deployments
    • Added documentation to deployment setup process

@github-actions github-actions bot added the fix label Nov 9, 2025
@tzulingk tzulingk enabled auto-merge (squash) November 9, 2025 05:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 9, 2025

Walkthrough

The Dockerfile for local VLLM deployment is updated to replace hardcoded, version-specific DeepEP path handling with version-agnostic pattern-based find-and-sed operations. The changes update workspace path references from /vllm-workspace to /opt/vllm across multiple file types using wildcard matching instead of explicit version strings.

Changes

Cohort / File(s) Change Summary
Dockerfile path normalization
tests/fault_tolerance/deploy/container/Dockerfile.local_vllm
Replaces targeted __editable__.deep_ep-0.0.1.pth file handling with version-agnostic find operations: updates all __editable__.deep_ep*.pth files and __editable___deep_ep*_finder.py modules to replace /vllm-workspace with /opt/vllm. Adds fallback find for remaining .pth files. Includes documentation comments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify wildcard patterns (*) correctly match intended file naming conventions
  • Ensure sed substitution syntax is correct for shell execution
  • Confirm fallback find operation doesn't cause unintended overwrites

Poem

🐰 Paths once fixed with version's name,
Now patterns broad do stake their claim,
From workspace deep to vllm's grace,
A rabbit finds each hiding place!
Version-agnostic, wild and free,
The Dockerfile hops happily!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix: Fix the wrong path of deep_ep kernel' accurately describes the main change—fixing an incorrect path for deep_ep packages.
Description check ✅ Passed The PR description fully complies with the template, including all required sections: Overview, Details, Where should the reviewer start, and Related Issues.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/fault_tolerance/deploy/container/Dockerfile.local_vllm (1)

109-109: Minor: Inconsistent sed quoting style between DeepEP .pth and finder module fixes.

Line 109 uses single quotes ('s|/vllm-workspace|/opt/vllm|g'), while line 111 uses double quotes ("s|'/vllm-workspace|'/opt/vllm|g"). The double quotes at line 111 are necessary because the pattern includes a literal single quote ('), but for consistency and clarity, consider standardizing the approach across both lines.

- find ${VIRTUAL_ENV}/lib/python${PYTHON_VERSION}/site-packages -name "__editable__.deep_ep*.pth" -exec sed -i 's|/vllm-workspace|/opt/vllm|g' {} \; && \
-    # Fix DeepEP finder modules (these contain the actual hardcoded MAPPING paths)
-    find ${VIRTUAL_ENV}/lib/python${PYTHON_VERSION}/site-packages -name "__editable___deep_ep*_finder.py" -exec sed -i "s|'/vllm-workspace|'/opt/vllm|g" {} \; && \
+ find ${VIRTUAL_ENV}/lib/python${PYTHON_VERSION}/site-packages -name "__editable__.deep_ep*.pth" -exec sed -i "s|/vllm-workspace|/opt/vllm|g" {} \; && \
+    # Fix DeepEP finder modules (these contain the actual hardcoded MAPPING paths)
+    find ${VIRTUAL_ENV}/lib/python${PYTHON_VERSION}/site-packages -name "__editable___deep_ep*_finder.py" -exec sed -i "s|'/vllm-workspace|'/opt/vllm|g" {} \; && \

Also applies to: 111-111

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09b26bf and 1e70735.

📒 Files selected for processing (1)
  • tests/fault_tolerance/deploy/container/Dockerfile.local_vllm (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 1444
File: tests/fault_tolerance/scenarios.py:57-57
Timestamp: 2025-07-01T15:39:56.789Z
Learning: The fault tolerance tests in tests/fault_tolerance/ are designed to run only in the mounted container environment, so hardcoded absolute paths with `/workspace/` prefix are intentional and should not be changed to relative paths.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
📚 Learning: 2025-07-01T15:39:56.789Z
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 1444
File: tests/fault_tolerance/scenarios.py:57-57
Timestamp: 2025-07-01T15:39:56.789Z
Learning: The fault tolerance tests in tests/fault_tolerance/ are designed to run only in the mounted container environment, so hardcoded absolute paths with `/workspace/` prefix are intentional and should not be changed to relative paths.

Applied to files:

  • tests/fault_tolerance/deploy/container/Dockerfile.local_vllm
📚 Learning: 2025-05-28T22:54:46.875Z
Learnt from: grahamking
Repo: ai-dynamo/dynamo PR: 1177
File: container/Dockerfile.vllm:102-105
Timestamp: 2025-05-28T22:54:46.875Z
Learning: In Dockerfiles, when appending to environment variables that may not exist in the base image, Docker validation will fail if you reference undefined variables with ${VARIABLE} syntax. In such cases, setting the environment variable directly (e.g., ENV CPATH=/usr/include) rather than appending is the appropriate approach.

Applied to files:

  • tests/fault_tolerance/deploy/container/Dockerfile.local_vllm
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: operator (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
tests/fault_tolerance/deploy/container/Dockerfile.local_vllm (1)

104-113: ✓ Path-fixing logic correctly handles version-agnostic patterns.

The transition from explicit version strings to wildcard-based find-and-sed operations is sound. Using __editable__.deep_ep*.pth instead of hardcoding the git-hash-containing version makes the logic robust and maintainable. The three-tier approach (DeepEP .pth files → finder modules → catch-all) is sensible.

The sed delimiter choice (|) is appropriate for filesystem paths. Error handling via implicit && chaining is acceptable here since find doesn't error on empty matches, and any actual failures will halt the build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants