-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[RFC][Core] propagate the error message up to the frontend process #25722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@842974287 has exported this pull request. If you are a Meta employee, you can view the originating diff in D83293058. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to propagate error messages from the engine core process to the frontend, which is a valuable enhancement for debugging. The implementation correctly adds error handling to propagate exceptions during initialization. However, I've found some critical issues in the wait_for_engine_startup function in vllm/v1/engine/utils.py. The new polling logic introduces a potential NameError, an IndexError, and can cause the process to hang if a worker fails silently. These issues need to be addressed to ensure the error handling is robust.
| events = poller.poll(STARTUP_POLL_PERIOD_MS) | ||
| if not events: | ||
| proc_manager_events = proc_manager_poll.poll(STARTUP_POLL_PERIOD_MS) | ||
| if not events and not proc_manager_events: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The revised loop structure for polling has several critical issues:
IndexError: Ifproc_manager_eventsis non-empty buteventsis empty, the checkif len(events) > 1 or events[0][0] != handshake_socket:on line 782 will raise anIndexErrorwhen accessingevents[0]. You should guard this access, for example withif events:.- Process Hang: In the same scenario (a worker dies silently, so
proc_manager_eventsis non-empty andeventsis empty), if theIndexErroris avoided, the code will proceed tohandshake_socket.recv_multipart()on line 791. Since there's no message, this will block indefinitely, causing a hang. Process exit events should be handled before trying to receive messages. - Incomplete Failure Detection: The check
if len(proc_manager_events) > 1:on line 868 is incorrect. If a single worker process fails silently,len(proc_manager_events)will be 1, and the failure will go undetected within the loop. This should likely beif proc_manager_events:.
These issues suggest that the loop logic needs to be restructured to correctly handle process exits from both coord_process and worker processes, especially when they fail without sending a ZMQ message.
vllm/v1/engine/utils.py
Outdated
| if coord_process is not None and coord_process.exitcode is not None: | ||
| finished[coord_process.name] = coord_process.exitcode | ||
| finished = {coord_process.name: coord_process.exitcode} | ||
| raise RuntimeError("Engine core initialization failed. " | ||
| "See root cause above. " | ||
| f"Failed core proc(s): {finished}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential NameError here. The finished variable is only defined inside the if block. If coord_process is None or coord_process.exitcode is None, finished will not be defined, but it is used in the RuntimeError f-string on line 788, which will cause a NameError. You should initialize finished before the if statement, for example with finished = {}.
| if coord_process is not None and coord_process.exitcode is not None: | |
| finished[coord_process.name] = coord_process.exitcode | |
| finished = {coord_process.name: coord_process.exitcode} | |
| raise RuntimeError("Engine core initialization failed. " | |
| "See root cause above. " | |
| f"Failed core proc(s): {finished}") | |
| finished = {} | |
| if coord_process is not None and coord_process.exitcode is not None: | |
| finished = {coord_process.name: coord_process.exitcode} | |
| raise RuntimeError("Engine core initialization failed. " | |
| "See root cause above. " | |
| f"Failed core proc(s): {finished}") |
eac2d71 to
7c97718
Compare
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @842974287, this is something I had also been thinking we should do. I'll review properly next week.
It would be even better to propagate the picked exception/traceback, as done in #23795, however we'll want to ensure this is only done when the engine is running on the same node as the front-end process (where ipc rather than tcp zmq transport is used).
vllm/v1/engine/core.py
Outdated
| except Exception as e: | ||
| zmq_socket.send( | ||
| msgspec.msgpack.encode({ | ||
| "status": | ||
| "FAILED", | ||
| "local": | ||
| local_client and client_handshake_address is None, | ||
| "headless": | ||
| not local_client, | ||
| "error_msg": | ||
| str(e), | ||
| })) | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be done inside the _perform_handshakes context manager instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved exception handling to _perform_handshakes().
Summary:
Currently when engine core proc initialization fails, the frontend process will always throw the following error.
```
Engine core initialization failed. See root cause above. Failed core proc(s): {}
```
In our use cases, we want to be able to get the original error msg that causes the engine core proc to fail in the frontend process for some additional handling/loggings.
In this PR, the error msg will be sent via the `ready_writer` pipe from engine core proc to the multiproc executor process. And then using the zmq socket for handshake to send the error msg from executor process to the frontend process (this also works for uni executor).
Now the error thrown by the frontend process looks like the following
```
Engine core initialization failed. See root cause above. Failed core proc(s): {}. Recived error message from failed engine 1: this is an error msg xxxxxxx
```
Differential Revision: D83293058
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
4d9238a to
5bc1b55
Compare
|
@njhill I updated the PR following your code comment. I haven't changed it to propagate the error object via pickle though as currently the error msg is enough for our use case. Let me know if you want me to do pickle instead. For pickle since you mentioned we should do it only for ipc so probably instead of using zmq socket, we can use a mp.queue to communicate between EngineCoreProc and client proc. |
Signed-off-by: Shiyan Deng <[email protected]>
Signed-off-by: Shiyan Deng <[email protected]>
Purpose
Currently when engine core proc initialization fails, the frontend process will always throw the following error.
In our use cases, we want to be able to get the original error msg that causes the engine core proc to fail in the frontend process for some additional handling/loggings.
Implementation
This is RFC so definitely let me know if I should do this in another way!
In this PR, the error msg will be sent via the
ready_writerpipe from engine core proc to the multiproc executor process. And then using the zmq socket for handshake to send the error msg from executor process to the frontend process (this also works for uni executor).Now the error thrown by the frontend process looks like the following
Differential Revision: D83293058