Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Nov 8, 2025

Overview:

This PR fixes the flaky test_vllm_health_check_active test by adding proper synchronization and error handling, allowing it to be re-enabled.

Details:

The test_vllm_health_check_active test was previously disabled due to flakiness. This PR addresses the root causes of the flakiness:

  1. Removed the skip marker: Removed @pytest.mark.skip(reason="Flaky, temporarily disabled") to re-enable the test.

  2. Added proper process termination wait: After killing the vLLM engine process, the test now explicitly waits for the process to fully terminate (up to 30 seconds) before proceeding, eliminating race conditions.

  3. Improved error handling for the test request: Added proper exception handling and status code validation when sending the completion request after the engine is killed. The test now correctly handles both HTTP error responses and connection exceptions.

  4. Added polling for worker shutdown: Replaced the single check with a polling loop (60 seconds max, 0.5s intervals) to wait for the worker process to stop after the EngineDeadError condition. This eliminates timing-related failures where the worker hadn't stopped yet when checked.

These changes ensure the test reliably validates the worker fault tolerance behavior without race conditions or timing issues.

Where should the reviewer start?

tests/fault_tolerance/test_vllm_health_check.py

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-993

Summary by CodeRabbit

Release Notes

  • Tests
    • Re-enabled a previously skipped fault tolerance test
    • Enhanced engine process tracking and termination verification
    • Improved error validation to confirm proper failure handling after engine termination
    • Added robust wait logic with timeout and polling for process cleanup

@tzulingk tzulingk requested review from a team as code owners November 8, 2025 03:52
@github-actions github-actions bot added the feat label Nov 8, 2025
@tzulingk tzulingk enabled auto-merge (squash) November 8, 2025 03:53
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 8, 2025

Walkthrough

A test for vLLM health checks was re-enabled and refactored to properly track engine process lifecycle, verify error handling after engine termination, and add robust waiting mechanisms to confirm worker process shutdown.

Changes

Cohort / File(s) Change Summary
vLLM health check test updates
tests/fault_tolerance/test_vllm_health_check.py
Removed skip marker to re-enable test; added vLLM engine process tracking; introduced robust waiting loop for worker process termination; reworked post-engine-kill request verification to explicitly assert error responses; enhanced error handling and logging for clearer diagnostics.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Verify the waiting loop logic for process termination and timeout handling
  • Confirm error response assertions correctly validate non-2xx status or request exceptions
  • Review process tracking and cleanup to ensure no resource leaks

Poem

🐰 A health check awakens, no longer in slumber,
We track the engine's pulse, digit by number,
When death comes to processes, errors run true,
A rabbit ensures the tests work anew! 🔧

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main objective of the PR: stabilizing a flaky test by updating its implementation.
Description check ✅ Passed The description comprehensively covers all required template sections with detailed explanations of changes, clear problem/solution framing, and references the target file and related issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/fault_tolerance/test_vllm_health_check.py (1)

1-1: Fix formatting with black.

The pre-commit hook indicates that the file needs to be formatted with black. Please run black on this file to fix the formatting issues before merging.

Run the following command to format the file:

black tests/fault_tolerance/test_vllm_health_check.py
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51c4fe6 and 660f42e.

📒 Files selected for processing (1)
  • tests/fault_tolerance/test_vllm_health_check.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 1890
File: examples/vllm/deploy/agg.yaml:63-70
Timestamp: 2025-07-14T23:01:16.218Z
Learning: In vLLM worker deployments, grep-based log checks for "VllmWorker.*has been initialized" are appropriate for readiness probes to verify worker startup, but should not be used for liveness probes which need to detect ongoing worker health.
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
🧬 Code graph analysis (1)
tests/fault_tolerance/test_vllm_health_check.py (2)
tests/kvbm_integration/test_cuda_graph.py (1)
  • send_completion_request (111-141)
tests/utils/managed_process.py (1)
  • is_running (549-553)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4200/merge) by tzulingk.
tests/fault_tolerance/test_vllm_health_check.py

[error] 1-1: Pre-commit hook 'black' reformatted the file. Hooks modified the file and the pre-commit run failed. Command: 'pre-commit run --show-diff-on-failure --color=always --all-files'.

🪛 Ruff (0.14.3)
tests/fault_tolerance/test_vllm_health_check.py

210-210: Do not catch blind exception: Exception

(BLE001)


211-211: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
tests/fault_tolerance/test_vllm_health_check.py (2)

214-226: LGTM: Correctly verifies failure behavior after engine termination.

The logic properly validates that the request fails after killing the vLLM engine by accepting either:

  • An error status code (>= 400), or
  • A RequestException

This is the correct approach for verifying the EngineDeadError condition.


227-241: LGTM: Robust polling loop with appropriate timeout.

The polling loop correctly uses the while-else pattern to wait for the worker to stop after the EngineDeadError condition. The 60-second timeout with 0.5-second intervals provides adequate time for graceful shutdown while maintaining quick feedback.

Comment on lines +194 to 203
vllm_child = None
for child in children:
cmdline = child.cmdline()
if len(cmdline) > 0 and cmdline[0] == "VLLM::EngineCore":
logger.warning(
f"Killing vLLM engine process {{ pid: {child.pid}, cmdline: '{' '.join(cmdline)}' }}"
)
vllm_child = child
child.kill()
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify that the vLLM engine process was found before proceeding.

If no child process with cmdline[0] == "VLLM::EngineCore" is found, vllm_child remains None and the test continues without actually killing the engine. This would make the test invalid since it wouldn't be testing the EngineDeadError condition.

Add a check after the loop to ensure the engine process was found:

             for child in children:
                 cmdline = child.cmdline()
                 if len(cmdline) > 0 and cmdline[0] == "VLLM::EngineCore":
                     logger.warning(
                         f"Killing vLLM engine process {{ pid: {child.pid}, cmdline: '{' '.join(cmdline)}' }}"
                     )
                     vllm_child = child
                     child.kill()
                     break
+
+            if vllm_child is None:
+                pytest.fail("Could not find VLLM::EngineCore process to kill")
🤖 Prompt for AI Agents
In tests/fault_tolerance/test_vllm_health_check.py around lines 194 to 203, the
loop may leave vllm_child as None if no child process matches
"VLLM::EngineCore"; add a check immediately after the loop that fails the test
if vllm_child is still None (e.g., assert vllm_child is not None or call
pytest.fail with a clear message) so the test does not continue without actually
killing the engine and invalidating the EngineDeadError scenario.

Comment on lines 205 to 212
# Wait for the vLLM engine process to actually die
if vllm_child:
try:
vllm_child.wait(timeout=30)
logger.info(f"vLLM engine process {vllm_child.pid} has been terminated")
except Exception as e:
logger.error(f"Failed to wait for vLLM engine process to die: {e}")
pytest.fail(f"vLLM engine process did not terminate: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Improve exception handling per static analysis recommendations.

The exception handling has two issues:

  1. Catching blind Exception is too broad and can mask unexpected errors.
  2. Using logging.error instead of logging.exception loses the traceback information.

Apply this diff to address both issues:

             # Wait for the vLLM engine process to actually die
             if vllm_child:
                 try:
                     vllm_child.wait(timeout=30)
                     logger.info(f"vLLM engine process {vllm_child.pid} has been terminated")
-                except Exception as e:
-                    logger.error(f"Failed to wait for vLLM engine process to die: {e}")
+                except (psutil.TimeoutExpired, psutil.NoSuchProcess) as e:
+                    logger.exception(f"Failed to wait for vLLM engine process to die: {e}")
                     pytest.fail(f"vLLM engine process did not terminate: {e}")

Note: You'll need to ensure psutil is imported if not already available. Alternatively, if using psutil.Process.wait(), catching psutil.TimeoutExpired is appropriate.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.3)

210-210: Do not catch blind exception: Exception

(BLE001)


211-211: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In tests/fault_tolerance/test_vllm_health_check.py around lines 205 to 212, the
try/except is catching a broad Exception and logging without traceback; change
it to catch the specific timeout exception (use subprocess.TimeoutExpired or
psutil.TimeoutExpired depending on whether vllm_child is a subprocess.Popen or
psutil.Process) and handle other unexpected errors by re-raising; replace
logger.error with logger.exception when logging the timeout failure so the
traceback is captured; if you decide to use psutil.TimeoutExpired, ensure psutil
is imported at the top of the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants