[RFC][Core] propagate the error message up to the frontend process #25722

842974287 · 2025-09-25T23:23:24Z

Purpose

Currently when engine core proc initialization fails, the frontend process will always throw the following error.

Engine core initialization failed. See root cause above. Failed core proc(s): {}

In our use cases, we want to be able to get the original error msg that causes the engine core proc to fail in the frontend process for some additional handling/loggings.

Implementation

This is RFC so definitely let me know if I should do this in another way!

In this PR, the error msg will be sent via the ready_writer pipe from engine core proc to the multiproc executor process. And then using the zmq socket for handshake to send the error msg from executor process to the frontend process (this also works for uni executor).

Now the error thrown by the frontend process looks like the following

Engine core initialization failed. See root cause above. Failed core proc(s): {}. Recived error message from failed engine 1: this is an error msg xxxxxxx

Differential Revision: D83293058

facebook-github-bot · 2025-09-25T23:23:38Z

@842974287 has exported this pull request. If you are a Meta employee, you can view the originating diff in D83293058.

gemini-code-assist

Code Review

This pull request aims to propagate error messages from the engine core process to the frontend, which is a valuable enhancement for debugging. The implementation correctly adds error handling to propagate exceptions during initialization. However, I've found some critical issues in the wait_for_engine_startup function in vllm/v1/engine/utils.py. The new polling logic introduces a potential NameError, an IndexError, and can cause the process to hang if a worker fails silently. These issues need to be addressed to ensure the error handling is robust.

gemini-code-assist · 2025-09-25T23:25:22Z

vllm/v1/engine/utils.py

        events = poller.poll(STARTUP_POLL_PERIOD_MS)
-        if not events:
+        proc_manager_events = proc_manager_poll.poll(STARTUP_POLL_PERIOD_MS)
+        if not events and not proc_manager_events:


The revised loop structure for polling has several critical issues:

IndexError: If proc_manager_events is non-empty but events is empty, the check if len(events) > 1 or events[0][0] != handshake_socket: on line 782 will raise an IndexError when accessing events[0]. You should guard this access, for example with if events:.

Process Hang: In the same scenario (a worker dies silently, so proc_manager_events is non-empty and events is empty), if the IndexError is avoided, the code will proceed to handshake_socket.recv_multipart() on line 791. Since there's no message, this will block indefinitely, causing a hang. Process exit events should be handled before trying to receive messages.

Incomplete Failure Detection: The check if len(proc_manager_events) > 1: on line 868 is incorrect. If a single worker process fails silently, len(proc_manager_events) will be 1, and the failure will go undetected within the loop. This should likely be if proc_manager_events:.

These issues suggest that the loop logic needs to be restructured to correctly handle process exits from both coord_process and worker processes, especially when they fail without sending a ZMQ message.

gemini-code-assist · 2025-09-25T23:25:22Z

vllm/v1/engine/utils.py

            if coord_process is not None and coord_process.exitcode is not None:
-                finished[coord_process.name] = coord_process.exitcode
+                finished = {coord_process.name: coord_process.exitcode}
            raise RuntimeError("Engine core initialization failed. "
                               "See root cause above. "
                               f"Failed core proc(s): {finished}")


There's a potential NameError here. The finished variable is only defined inside the if block. If coord_process is None or coord_process.exitcode is None, finished will not be defined, but it is used in the RuntimeError f-string on line 788, which will cause a NameError. You should initialize finished before the if statement, for example with finished = {}.

Suggested change

if coord_process is not None and coord_process.exitcode is not None:

finished[coord_process.name] = coord_process.exitcode

finished = {coord_process.name: coord_process.exitcode}

raise RuntimeError("Engine core initialization failed. "

"See root cause above. "

f"Failed core proc(s): {finished}")

finished = {}

if coord_process is not None and coord_process.exitcode is not None:

finished = {coord_process.name: coord_process.exitcode}

raise RuntimeError("Engine core initialization failed. "

"See root cause above. "

f"Failed core proc(s): {finished}")

njhill

Thanks @842974287, this is something I had also been thinking we should do. I'll review properly next week.

It would be even better to propagate the picked exception/traceback, as done in #23795, however we'll want to ensure this is only done when the engine is running on the same node as the front-end process (where ipc rather than tcp zmq transport is used).

njhill · 2025-09-27T20:51:44Z

vllm/v1/engine/core.py

+            except Exception as e:
+                zmq_socket.send(
+                    msgspec.msgpack.encode({
+                        "status":
+                        "FAILED",
+                        "local":
+                        local_client and client_handshake_address is None,
+                        "headless":
+                        not local_client,
+                        "error_msg":
+                        str(e),
+                    }))
+                raise


Can this be done inside the _perform_handshakes context manager instead?

Moved exception handling to _perform_handshakes().

Summary: Currently when engine core proc initialization fails, the frontend process will always throw the following error. ``` Engine core initialization failed. See root cause above. Failed core proc(s): {} ``` In our use cases, we want to be able to get the original error msg that causes the engine core proc to fail in the frontend process for some additional handling/loggings. In this PR, the error msg will be sent via the `ready_writer` pipe from engine core proc to the multiproc executor process. And then using the zmq socket for handshake to send the error msg from executor process to the frontend process (this also works for uni executor). Now the error thrown by the frontend process looks like the following ``` Engine core initialization failed. See root cause above. Failed core proc(s): {}. Recived error message from failed engine 1: this is an error msg xxxxxxx ``` Differential Revision: D83293058 Signed-off-by: Shiyan Deng <[email protected]>

Signed-off-by: Shiyan Deng <[email protected]>

842974287 · 2025-10-09T00:08:39Z

@njhill I updated the PR following your code comment. I haven't changed it to propagate the error object via pickle though as currently the error msg is enough for our use case. Let me know if you want me to do pickle instead. For pickle since you mentioned we should do it only for ipc so probably instead of using zmq socket, we can use a mp.queue to communicate between EngineCoreProc and client proc.

Signed-off-by: Shiyan Deng <[email protected]>

842974287 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 25, 2025 23:23

mergify bot added the v1 label Sep 25, 2025

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

842974287 changed the title ~~propagate the error message up to the frontend process~~ [RFC] propagate the error message up to the frontend process Sep 25, 2025

842974287 force-pushed the export-D83293058 branch from eac2d71 to 7c97718 Compare September 25, 2025 23:29

842974287 changed the title ~~[RFC] propagate the error message up to the frontend process~~ [RFC][Core] propagate the error message up to the frontend process Sep 26, 2025

njhill assigned 842974287 Sep 27, 2025

njhill reviewed Sep 27, 2025

View reviewed changes

842974287 added 6 commits October 8, 2025 16:59

fix linter

9328868

Signed-off-by: Shiyan Deng <[email protected]>

fix

fe3f6eb

Signed-off-by: Shiyan Deng <[email protected]>

linters

5941919

Signed-off-by: Shiyan Deng <[email protected]>

lint

fbcc3b4

Signed-off-by: Shiyan Deng <[email protected]>

fix

5bc1b55

Signed-off-by: Shiyan Deng <[email protected]>

842974287 force-pushed the export-D83293058 branch from 4d9238a to 5bc1b55 Compare October 9, 2025 00:04

842974287 added 2 commits October 8, 2025 17:14

lint

107ede1

Signed-off-by: Shiyan Deng <[email protected]>

fix

dec6637

Signed-off-by: Shiyan Deng <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC][Core] propagate the error message up to the frontend process #25722

[RFC][Core] propagate the error message up to the frontend process #25722

842974287 commented Sep 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

facebook-github-bot commented Sep 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

gemini-code-assist bot Sep 25, 2025

Uh oh!

njhill left a comment

Uh oh!

njhill Sep 27, 2025

Uh oh!

842974287 Oct 9, 2025

Uh oh!

842974287 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[RFC][Core] propagate the error message up to the frontend process #25722

Are you sure you want to change the base?

[RFC][Core] propagate the error message up to the frontend process #25722

Conversation

842974287 commented Sep 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Implementation

Uh oh!

facebook-github-bot commented Sep 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

842974287 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

842974287 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

842974287 commented Sep 25, 2025 •

edited by github-actions bot

Loading