-
Notifications
You must be signed in to change notification settings - Fork 676
feat: Update test_vllm_health_check_active to make it stable #4200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: [email protected] <[email protected]>
WalkthroughA test for vLLM health checks was re-enabled and refactored to properly track engine process lifecycle, verify error handling after engine termination, and add robust waiting mechanisms to confirm worker process shutdown. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/fault_tolerance/test_vllm_health_check.py (1)
1-1: Fix formatting with black.The pre-commit hook indicates that the file needs to be formatted with black. Please run
blackon this file to fix the formatting issues before merging.Run the following command to format the file:
black tests/fault_tolerance/test_vllm_health_check.py
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/fault_tolerance/test_vllm_health_check.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 1890
File: examples/vllm/deploy/agg.yaml:63-70
Timestamp: 2025-07-14T23:01:16.218Z
Learning: In vLLM worker deployments, grep-based log checks for "VllmWorker.*has been initialized" are appropriate for readiness probes to verify worker startup, but should not be used for liveness probes which need to detect ongoing worker health.
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
🧬 Code graph analysis (1)
tests/fault_tolerance/test_vllm_health_check.py (2)
tests/kvbm_integration/test_cuda_graph.py (1)
send_completion_request(111-141)tests/utils/managed_process.py (1)
is_running(549-553)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4200/merge) by tzulingk.
tests/fault_tolerance/test_vllm_health_check.py
[error] 1-1: Pre-commit hook 'black' reformatted the file. Hooks modified the file and the pre-commit run failed. Command: 'pre-commit run --show-diff-on-failure --color=always --all-files'.
🪛 Ruff (0.14.3)
tests/fault_tolerance/test_vllm_health_check.py
210-210: Do not catch blind exception: Exception
(BLE001)
211-211: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
tests/fault_tolerance/test_vllm_health_check.py (2)
214-226: LGTM: Correctly verifies failure behavior after engine termination.The logic properly validates that the request fails after killing the vLLM engine by accepting either:
- An error status code (>= 400), or
- A
RequestExceptionThis is the correct approach for verifying the EngineDeadError condition.
227-241: LGTM: Robust polling loop with appropriate timeout.The polling loop correctly uses the while-else pattern to wait for the worker to stop after the EngineDeadError condition. The 60-second timeout with 0.5-second intervals provides adequate time for graceful shutdown while maintaining quick feedback.
| vllm_child = None | ||
| for child in children: | ||
| cmdline = child.cmdline() | ||
| if len(cmdline) > 0 and cmdline[0] == "VLLM::EngineCore": | ||
| logger.warning( | ||
| f"Killing vLLM engine process {{ pid: {child.pid}, cmdline: '{' '.join(cmdline)}' }}" | ||
| ) | ||
| vllm_child = child | ||
| child.kill() | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify that the vLLM engine process was found before proceeding.
If no child process with cmdline[0] == "VLLM::EngineCore" is found, vllm_child remains None and the test continues without actually killing the engine. This would make the test invalid since it wouldn't be testing the EngineDeadError condition.
Add a check after the loop to ensure the engine process was found:
for child in children:
cmdline = child.cmdline()
if len(cmdline) > 0 and cmdline[0] == "VLLM::EngineCore":
logger.warning(
f"Killing vLLM engine process {{ pid: {child.pid}, cmdline: '{' '.join(cmdline)}' }}"
)
vllm_child = child
child.kill()
break
+
+ if vllm_child is None:
+ pytest.fail("Could not find VLLM::EngineCore process to kill")🤖 Prompt for AI Agents
In tests/fault_tolerance/test_vllm_health_check.py around lines 194 to 203, the
loop may leave vllm_child as None if no child process matches
"VLLM::EngineCore"; add a check immediately after the loop that fails the test
if vllm_child is still None (e.g., assert vllm_child is not None or call
pytest.fail with a clear message) so the test does not continue without actually
killing the engine and invalidating the EngineDeadError scenario.
| # Wait for the vLLM engine process to actually die | ||
| if vllm_child: | ||
| try: | ||
| vllm_child.wait(timeout=30) | ||
| logger.info(f"vLLM engine process {vllm_child.pid} has been terminated") | ||
| except Exception as e: | ||
| logger.error(f"Failed to wait for vLLM engine process to die: {e}") | ||
| pytest.fail(f"vLLM engine process did not terminate: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve exception handling per static analysis recommendations.
The exception handling has two issues:
- Catching blind
Exceptionis too broad and can mask unexpected errors. - Using
logging.errorinstead oflogging.exceptionloses the traceback information.
Apply this diff to address both issues:
# Wait for the vLLM engine process to actually die
if vllm_child:
try:
vllm_child.wait(timeout=30)
logger.info(f"vLLM engine process {vllm_child.pid} has been terminated")
- except Exception as e:
- logger.error(f"Failed to wait for vLLM engine process to die: {e}")
+ except (psutil.TimeoutExpired, psutil.NoSuchProcess) as e:
+ logger.exception(f"Failed to wait for vLLM engine process to die: {e}")
pytest.fail(f"vLLM engine process did not terminate: {e}")Note: You'll need to ensure psutil is imported if not already available. Alternatively, if using psutil.Process.wait(), catching psutil.TimeoutExpired is appropriate.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.3)
210-210: Do not catch blind exception: Exception
(BLE001)
211-211: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In tests/fault_tolerance/test_vllm_health_check.py around lines 205 to 212, the
try/except is catching a broad Exception and logging without traceback; change
it to catch the specific timeout exception (use subprocess.TimeoutExpired or
psutil.TimeoutExpired depending on whether vllm_child is a subprocess.Popen or
psutil.Process) and handle other unexpected errors by re-raising; replace
logger.error with logger.exception when logging the timeout failure so the
traceback is captured; if you decide to use psutil.TimeoutExpired, ensure psutil
is imported at the top of the file.
Overview:
This PR fixes the flaky
test_vllm_health_check_activetest by adding proper synchronization and error handling, allowing it to be re-enabled.Details:
The
test_vllm_health_check_activetest was previously disabled due to flakiness. This PR addresses the root causes of the flakiness:Removed the skip marker: Removed
@pytest.mark.skip(reason="Flaky, temporarily disabled")to re-enable the test.Added proper process termination wait: After killing the vLLM engine process, the test now explicitly waits for the process to fully terminate (up to 30 seconds) before proceeding, eliminating race conditions.
Improved error handling for the test request: Added proper exception handling and status code validation when sending the completion request after the engine is killed. The test now correctly handles both HTTP error responses and connection exceptions.
Added polling for worker shutdown: Replaced the single check with a polling loop (60 seconds max, 0.5s intervals) to wait for the worker process to stop after the EngineDeadError condition. This eliminates timing-related failures where the worker hadn't stopped yet when checked.
These changes ensure the test reliably validates the worker fault tolerance behavior without race conditions or timing issues.
Where should the reviewer start?
tests/fault_tolerance/test_vllm_health_check.pyRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
DIS-993
Summary by CodeRabbit
Release Notes