Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Aug 28, 2025

QOL improvement

We use a create_new_process_for_each_test decorator on many tests to avoid cleanup issues. However when such tests fail the pytest error is from the process failure not the actual test failure. This change propagates the actual test exception of interest so that it's reported by pytest as if the test was running inline.

Before:
image

After:
image

@mergify mergify bot added the ci/build label Aug 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great quality-of-life improvement for debugging tests that run in forked processes. By serializing and re-raising exceptions from the child process, it ensures that the root-cause error is not lost, which will significantly speed up debugging.

I've found one critical bug related to operator precedence that prevents the exception from being correctly re-raised, and one high-severity issue concerning a resource leak where temporary files are not being cleaned up. Addressing these issues will make this a solid and robust enhancement.

@njhill njhill force-pushed the root-cause-failures branch from 5bc6669 to 00f167a Compare August 28, 2025 06:59
@vllm-project vllm-project deleted a comment from gemini-code-assist bot Aug 28, 2025
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 28, 2025
@njhill njhill linked an issue Aug 28, 2025 that may be closed by this pull request
1 task
@njhill njhill removed a link to an issue Aug 28, 2025
1 task
@DarkLight1337
Copy link
Member

Retrying async tests to see if the timeout is just flaky or caused by this PR.

@njhill
Copy link
Member Author

njhill commented Aug 29, 2025

Retrying async tests to see if the timeout is just flaky or caused by this PR.

Thanks @DarkLight1337. Looks like it hung again at the same place after ~10min. I'll cancel it so that it doesn't timeout after 3hrs again, will investigate tomorrow.

@njhill njhill force-pushed the root-cause-failures branch from 1510ff0 to a5b79e2 Compare September 5, 2025 21:50
@simon-mo simon-mo merged commit 4db4426 into vllm-project:main Sep 10, 2025
69 of 72 checks passed
@njhill njhill deleted the root-cause-failures branch September 10, 2025 21:43
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants