-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[BugFix][Core] Fix error when enable async-scheduling in multi-node env #25887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix][Core] Fix error when enable async-scheduling in multi-node env #25887
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a launch failure for async scheduling in a multi-node environment by adjusting when the distributed executor backend is configured. The change correctly removes the premature default setting of the backend to mp. However, the new validation logic for supported backends with async scheduling seems to have some inconsistencies. I've added a comment with a suggestion to clarify this logic and make it consistent with the information provided in the pull request description.
|
@WoosukKwon @benchislett Hello, could you please review this MR? |
2a3e01f to
5f213a1
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
5f213a1 to
5aa3adc
Compare
|
@WoosukKwon @benchislett Hi, could you take a look at this PR? |
benchislett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one grammar nit
f4bb3bd to
54e56a5
Compare
…he default selection. Signed-off-by: Lehua Ding <[email protected]>
Co-authored-by: Benjamin Chislett <[email protected]> Signed-off-by: Lehua Ding <[email protected]>
Signed-off-by: Lehua Ding <[email protected]>
54e56a5 to
bc67b60
Compare
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]>
…nv (vllm-project#25887) Signed-off-by: Lehua Ding <[email protected]> Signed-off-by: Lehua Ding <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]>
When launching in a multi-node environment (e.g., TP16), the ParallelConfig automatically selects
rayas the distributed_executor_backend. However, when async scheduling is enabled, it prematurely sets the default value of distributed_executor_backendto to mp, causing a launch failure like bellow. This fix moves the check to after that the backend is auto-selected.Currently, async scheduling (primarily the fully overlap feature) does not support Ray as a backend(error like bellow). Support for this can be added in a future PR.