feat: Update test_vllm_health_check_active to make it stable #4200

tzulingk · 2025-11-08T03:52:10Z

Overview:

This PR fixes the flaky test_vllm_health_check_active test by adding proper synchronization and error handling, allowing it to be re-enabled.

Details:

The test_vllm_health_check_active test was previously disabled due to flakiness. This PR addresses the root causes of the flakiness:

Removed the skip marker: Removed @pytest.mark.skip(reason="Flaky, temporarily disabled") to re-enable the test.
Added proper process termination wait: After killing the vLLM engine process, the test now explicitly waits for the process to fully terminate (up to 30 seconds) before proceeding, eliminating race conditions.
Improved error handling for the test request: Added proper exception handling and status code validation when sending the completion request after the engine is killed. The test now correctly handles both HTTP error responses and connection exceptions.
Added polling for worker shutdown: Replaced the single check with a polling loop (60 seconds max, 0.5s intervals) to wait for the worker process to stop after the EngineDeadError condition. This eliminates timing-related failures where the worker hadn't stopped yet when checked.

These changes ensure the test reliably validates the worker fault tolerance behavior without race conditions or timing issues.

Where should the reviewer start?

tests/fault_tolerance/test_vllm_health_check.py

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-993

Summary by CodeRabbit

Release Notes

Tests
- Re-enabled a previously skipped fault tolerance test
- Enhanced engine process tracking and termination verification
- Improved error validation to confirm proper failure handling after engine termination
- Added robust wait logic with timeout and polling for process cleanup

Signed-off-by: [email protected] <[email protected]>

coderabbitai · 2025-11-08T03:55:25Z

Walkthrough

A test for vLLM health checks was re-enabled and refactored to properly track engine process lifecycle, verify error handling after engine termination, and add robust waiting mechanisms to confirm worker process shutdown.

Changes

Cohort / File(s)	Change Summary
vLLM health check test updates `tests/fault_tolerance/test_vllm_health_check.py`	Removed skip marker to re-enable test; added vLLM engine process tracking; introduced robust waiting loop for worker process termination; reworked post-engine-kill request verification to explicitly assert error responses; enhanced error handling and logging for clearer diagnostics.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Verify the waiting loop logic for process termination and timeout handling
Confirm error response assertions correctly validate non-2xx status or request exceptions
Review process tracking and cleanup to ensure no resource leaks

Poem

🐰 A health check awakens, no longer in slumber,
We track the engine's pulse, digit by number,
When death comes to processes, errors run true,
A rabbit ensures the tests work anew! 🔧

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main objective of the PR: stabilizing a flaky test by updating its implementation.
Description check	✅ Passed	The description comprehensively covers all required template sections with detailed explanations of changes, clear problem/solution framing, and references the target file and related issue.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/fault_tolerance/test_vllm_health_check.py (1)
1-1: Fix formatting with black.

The pre-commit hook indicates that the file needs to be formatted with black. Please run black on this file to fix the formatting issues before merging.

Run the following command to format the file:
black tests/fault_tolerance/test_vllm_health_check.py

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51c4fe6 and 660f42e.

📒 Files selected for processing (1)

tests/fault_tolerance/test_vllm_health_check.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 1890
File: examples/vllm/deploy/agg.yaml:63-70
Timestamp: 2025-07-14T23:01:16.218Z
Learning: In vLLM worker deployments, grep-based log checks for "VllmWorker.*has been initialized" are appropriate for readiness probes to verify worker startup, but should not be used for liveness probes which need to detect ongoing worker health.

Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

🧬 Code graph analysis (1)

tests/fault_tolerance/test_vllm_health_check.py (2)

tests/kvbm_integration/test_cuda_graph.py (1)

send_completion_request (111-141)

tests/utils/managed_process.py (1)

is_running (549-553)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4200/merge) by tzulingk.

tests/fault_tolerance/test_vllm_health_check.py

[error] 1-1: Pre-commit hook 'black' reformatted the file. Hooks modified the file and the pre-commit run failed. Command: 'pre-commit run --show-diff-on-failure --color=always --all-files'.

🪛 Ruff (0.14.3)

tests/fault_tolerance/test_vllm_health_check.py

210-210: Do not catch blind exception: Exception

(BLE001)

211-211: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: operator (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

tests/fault_tolerance/test_vllm_health_check.py (2)

214-226: LGTM: Correctly verifies failure behavior after engine termination.

The logic properly validates that the request fails after killing the vLLM engine by accepting either:

An error status code (>= 400), or

A RequestException

This is the correct approach for verifying the EngineDeadError condition.

227-241: LGTM: Robust polling loop with appropriate timeout.

The polling loop correctly uses the while-else pattern to wait for the worker to stop after the EngineDeadError condition. The 60-second timeout with 0.5-second intervals provides adequate time for graceful shutdown while maintaining quick feedback.

coderabbitai · 2025-11-08T03:55:28Z

tests/fault_tolerance/test_vllm_health_check.py

+            vllm_child = None
            for child in children:
                cmdline = child.cmdline()
                if len(cmdline) > 0 and cmdline[0] == "VLLM::EngineCore":
                    logger.warning(
                        f"Killing vLLM engine process {{ pid: {child.pid}, cmdline: '{' '.join(cmdline)}' }}"
                    )
+                    vllm_child = child
                    child.kill()
                    break


⚠️ Potential issue | 🟠 Major

Verify that the vLLM engine process was found before proceeding.

If no child process with cmdline[0] == "VLLM::EngineCore" is found, vllm_child remains None and the test continues without actually killing the engine. This would make the test invalid since it wouldn't be testing the EngineDeadError condition.

Add a check after the loop to ensure the engine process was found:

for child in children: cmdline = child.cmdline() if len(cmdline) > 0 and cmdline[0] == "VLLM::EngineCore": logger.warning( f"Killing vLLM engine process {{ pid: {child.pid}, cmdline: '{' '.join(cmdline)}' }}" ) vllm_child = child child.kill() break + + if vllm_child is None: + pytest.fail("Could not find VLLM::EngineCore process to kill")

🤖 Prompt for AI Agents

In tests/fault_tolerance/test_vllm_health_check.py around lines 194 to 203, the loop may leave vllm_child as None if no child process matches "VLLM::EngineCore"; add a check immediately after the loop that fails the test if vllm_child is still None (e.g., assert vllm_child is not None or call pytest.fail with a clear message) so the test does not continue without actually killing the engine and invalidating the EngineDeadError scenario.

coderabbitai · 2025-11-08T03:55:28Z

tests/fault_tolerance/test_vllm_health_check.py

+            # Wait for the vLLM engine process to actually die
+            if vllm_child:
+                try:
+                    vllm_child.wait(timeout=30)
+                    logger.info(f"vLLM engine process {vllm_child.pid} has been terminated")
+                except Exception as e:
+                    logger.error(f"Failed to wait for vLLM engine process to die: {e}")
+                    pytest.fail(f"vLLM engine process did not terminate: {e}")


⚠️ Potential issue | 🟡 Minor

Improve exception handling per static analysis recommendations.

The exception handling has two issues:

Catching blind Exception is too broad and can mask unexpected errors.

Using logging.error instead of logging.exception loses the traceback information.

Apply this diff to address both issues:

# Wait for the vLLM engine process to actually die if vllm_child: try: vllm_child.wait(timeout=30) logger.info(f"vLLM engine process {vllm_child.pid} has been terminated") - except Exception as e: - logger.error(f"Failed to wait for vLLM engine process to die: {e}") + except (psutil.TimeoutExpired, psutil.NoSuchProcess) as e: + logger.exception(f"Failed to wait for vLLM engine process to die: {e}") pytest.fail(f"vLLM engine process did not terminate: {e}")

Note: You'll need to ensure psutil is imported if not already available. Alternatively, if using psutil.Process.wait(), catching psutil.TimeoutExpired is appropriate.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.3)

210-210: Do not catch blind exception: Exception

(BLE001)

211-211: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In tests/fault_tolerance/test_vllm_health_check.py around lines 205 to 212, the try/except is catching a broad Exception and logging without traceback; change it to catch the specific timeout exception (use subprocess.TimeoutExpired or psutil.TimeoutExpired depending on whether vllm_child is a subprocess.Popen or psutil.Process) and handle other unexpected errors by re-raising; replace logger.error with logger.exception when logging the timeout failure so the traceback is captured; if you decide to use psutil.TimeoutExpired, ensure psutil is imported at the top of the file.

Update test_vllm_health_check_active to make it stable

660f42e

Signed-off-by: [email protected] <[email protected]>

tzulingk requested review from a team as code owners November 8, 2025 03:52

pull-request-size bot added the size/M label Nov 8, 2025

github-actions bot added the feat label Nov 8, 2025

tzulingk requested review from kthui, nnshah1 and tedzhouhk November 8, 2025 03:53

tzulingk enabled auto-merge (squash) November 8, 2025 03:53

black format

6150509

Signed-off-by: [email protected] <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 8, 2025 03:55 Inactive

coderabbitai bot reviewed Nov 8, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB November 8, 2025 03:55 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Update test_vllm_health_check_active to make it stable #4200

feat: Update test_vllm_health_check_active to make it stable #4200

Uh oh!

tzulingk commented Nov 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 8, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 8, 2025

Uh oh!

coderabbitai bot Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Update test_vllm_health_check_active to make it stable #4200

Are you sure you want to change the base?

feat: Update test_vllm_health_check_active to make it stable #4200

Uh oh!

Conversation

tzulingk commented Nov 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 8, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tzulingk commented Nov 8, 2025 •

edited by coderabbitai bot

Loading