Skip to content

Conversation

@842974287
Copy link
Contributor

@842974287 842974287 commented Sep 25, 2025

Purpose

Currently when engine core proc initialization fails, the frontend process will always throw the following error.

Engine core initialization failed. See root cause above. Failed core proc(s): {}

In our use cases, we want to be able to get the original error msg that causes the engine core proc to fail in the frontend process for some additional handling/loggings.

Implementation

This is RFC so definitely let me know if I should do this in another way!

In this PR, the error msg will be sent via the ready_writer pipe from engine core proc to the multiproc executor process. And then using the zmq socket for handshake to send the error msg from executor process to the frontend process (this also works for uni executor).

Now the error thrown by the frontend process looks like the following

Engine core initialization failed. See root cause above. Failed core proc(s): {}. Recived error message from failed engine 1: this is an error msg xxxxxxx

Differential Revision: D83293058

@facebook-github-bot
Copy link

@842974287 has exported this pull request. If you are a Meta employee, you can view the originating diff in D83293058.

@mergify mergify bot added the v1 label Sep 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to propagate error messages from the engine core process to the frontend, which is a valuable enhancement for debugging. The implementation correctly adds error handling to propagate exceptions during initialization. However, I've found some critical issues in the wait_for_engine_startup function in vllm/v1/engine/utils.py. The new polling logic introduces a potential NameError, an IndexError, and can cause the process to hang if a worker fails silently. These issues need to be addressed to ensure the error handling is robust.

Comment on lines 769 to +816
events = poller.poll(STARTUP_POLL_PERIOD_MS)
if not events:
proc_manager_events = proc_manager_poll.poll(STARTUP_POLL_PERIOD_MS)
if not events and not proc_manager_events:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The revised loop structure for polling has several critical issues:

  1. IndexError: If proc_manager_events is non-empty but events is empty, the check if len(events) > 1 or events[0][0] != handshake_socket: on line 782 will raise an IndexError when accessing events[0]. You should guard this access, for example with if events:.
  2. Process Hang: In the same scenario (a worker dies silently, so proc_manager_events is non-empty and events is empty), if the IndexError is avoided, the code will proceed to handshake_socket.recv_multipart() on line 791. Since there's no message, this will block indefinitely, causing a hang. Process exit events should be handled before trying to receive messages.
  3. Incomplete Failure Detection: The check if len(proc_manager_events) > 1: on line 868 is incorrect. If a single worker process fails silently, len(proc_manager_events) will be 1, and the failure will go undetected within the loop. This should likely be if proc_manager_events:.

These issues suggest that the loop logic needs to be restructured to correctly handle process exits from both coord_process and worker processes, especially when they fail without sending a ZMQ message.

Comment on lines 784 to 836
if coord_process is not None and coord_process.exitcode is not None:
finished[coord_process.name] = coord_process.exitcode
finished = {coord_process.name: coord_process.exitcode}
raise RuntimeError("Engine core initialization failed. "
"See root cause above. "
f"Failed core proc(s): {finished}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential NameError here. The finished variable is only defined inside the if block. If coord_process is None or coord_process.exitcode is None, finished will not be defined, but it is used in the RuntimeError f-string on line 788, which will cause a NameError. You should initialize finished before the if statement, for example with finished = {}.

Suggested change
if coord_process is not None and coord_process.exitcode is not None:
finished[coord_process.name] = coord_process.exitcode
finished = {coord_process.name: coord_process.exitcode}
raise RuntimeError("Engine core initialization failed. "
"See root cause above. "
f"Failed core proc(s): {finished}")
finished = {}
if coord_process is not None and coord_process.exitcode is not None:
finished = {coord_process.name: coord_process.exitcode}
raise RuntimeError("Engine core initialization failed. "
"See root cause above. "
f"Failed core proc(s): {finished}")

@842974287 842974287 changed the title propagate the error message up to the frontend process [RFC] propagate the error message up to the frontend process Sep 25, 2025
@842974287 842974287 changed the title [RFC] propagate the error message up to the frontend process [RFC][Core] propagate the error message up to the frontend process Sep 26, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @842974287, this is something I had also been thinking we should do. I'll review properly next week.

It would be even better to propagate the picked exception/traceback, as done in #23795, however we'll want to ensure this is only done when the engine is running on the same node as the front-end process (where ipc rather than tcp zmq transport is used).

Comment on lines 532 to 544
except Exception as e:
zmq_socket.send(
msgspec.msgpack.encode({
"status":
"FAILED",
"local":
local_client and client_handshake_address is None,
"headless":
not local_client,
"error_msg":
str(e),
}))
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be done inside the _perform_handshakes context manager instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved exception handling to _perform_handshakes().

Summary:
Currently when engine core proc initialization fails, the frontend process will always throw the following error.
```
Engine core initialization failed. See root cause above. Failed core proc(s): {}
```
In our use cases, we want to be able to get the original error msg that causes the engine core proc to fail in the frontend process for some additional handling/loggings.

In this PR, the error msg will be sent via the `ready_writer` pipe from engine core proc to the multiproc executor process. And then using the zmq socket for handshake to send the error msg from executor process to the frontend process (this also works for uni executor).

Now the error thrown by the frontend process looks like the following
```
Engine core initialization failed. See root cause above. Failed core proc(s): {}. Recived error message from failed engine 1: this is an error msg xxxxxxx
```

Differential Revision: D83293058

Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
@842974287
Copy link
Contributor Author

@njhill I updated the PR following your code comment. I haven't changed it to propagate the error object via pickle though as currently the error msg is enough for our use case. Let me know if you want me to do pickle instead. For pickle since you mentioned we should do it only for ipc so probably instead of using zmq socket, we can use a mp.queue to communicate between EngineCoreProc and client proc.

Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants