[Core] Support async scheduling with uniproc executor #24219

njhill · 2025-09-04T05:37:50Z

Follow-on from #23569.

This provides most of the speedup of that PR, like +22% rather than +25%. We could experiment with a slightly more complicated version where the worker runs in a separate thread, but this seems like a good first implementation.

weijinqian0 · 2025-09-05T01:32:10Z

In RL training scenarios, inference groups are typically managed externally, so 'external launcher' is needed.

Ronald1995 · 2025-09-05T09:11:39Z

This PR is based on top of #23569.

This provides most of the speedup of that PR, like +22% rather than +25%. We could experiment with a slightly more complicated version where the worker runs in a separate thread, but it seems unlikely it would exceed the MP performance anyhow given that the uniproc executor now appears to be no faster than the mulitproc one in general.

Implement async_scheduling in uniproc executor not because it's faster than multiproc, but because we need to use uniproc(external launcher method) in RL training.

njhill · 2025-09-05T13:46:03Z

@Ronald1995 @weijinqian0 I pushed another commit to also support external launcher executor, maybe you could try it out?

Ronald1995 · 2025-09-06T02:32:14Z

@Ronald1995 @weijinqian0 I pushed another commit to also support external launcher executor, maybe you could try it out?

@njhill thanks for implementation of exeternal launcher method. i have tested it in my local environment base on your branch, and make some little bugfix, both the performance and precision are validated, would you please just cherry-pick my bugfix commit.

Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-09-06T03:11:01Z

@Ronald1995 @weijinqian0 I pushed another commit to also support external launcher executor, maybe you could try it out?

@njhill thanks for implementation of exeternal launcher method. i have tested it in my local environment base on your branch, and make some little bugfix, both the performance and precision are validated, would you please just cherry-pick my bugfix commit.

Thanks @Ronald1995. From your commit:

        if isinstance(outputs, Exception):
            logger.error("EngineCore step failed with error: %s", outputs)
            raise outputs

I don't think step_fn will ever return an exception, do you agree or could you show me why you think this is needed?

Signed-off-by: Ronald1995 <[email protected]> Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

Ronald1995 · 2025-09-06T03:39:46Z

@Ronald1995 @weijinqian0 I pushed another commit to also support external launcher executor, maybe you could try it out?

@njhill thanks for implementation of exeternal launcher method. i have tested it in my local environment base on your branch, and make some little bugfix, both the performance and precision are validated, would you please just cherry-pick my bugfix commit.

Thanks @Ronald1995. From your commit:
        if isinstance(outputs, Exception):
            logger.error("EngineCore step failed with error: %s", outputs)
            raise outputs
I don't think step_fn will ever return an exception, do you agree or could you show me why you think this is needed?

i just see get_output method in SyncMPClient handle the exception. but i think you are right, the execute_model_with_error_logging will catch the exception and just raise it, output will not be exception

Signed-off-by: Nick Hill <[email protected]>

…async-sched

njhill · 2025-09-08T17:53:10Z

@Ronald1995 any chance you could try out the latest version of this PR with your use case?

I think this is ready apart from extra CI test coverage.

Ronald1995 · 2025-09-09T13:01:55Z

@Ronald1995 any chance you could try out the latest version of this PR with your use case?

I think this is ready apart from extra CI test coverage.

OK，I will test the latest version in my local environment. because the GPU resources is occupied by my colleague, i can't test it right now, i will complete this test until tomorrow morning and report the results for you.

Ronald1995 · 2025-09-10T01:15:22Z

@Ronald1995 any chance you could try out the latest version of this PR with your use case?
I think this is ready apart from extra CI test coverage.

OK，I will test the latest version in my local environment. because the GPU resources is occupied by my colleague, i can't test it right now, i will complete this test until tomorrow morning and report the results for you.

@njhill i have tested the latest version with external_launcher scenario, both performance and precision meet expectation.

This PR is based on top of [#23569](vllm-project/vllm#23569) and [#24219](vllm-project/vllm#24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: ||forward async| scheduler async |sync| |-|-|-|-| |avg|41.73|41.86|44.20| |improve0|0.3%|0|0| |improve1|5.58%|0|0| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: ||forward async|sync| |-|-|-| |avg|23.22|29.16| |improve|20.3%|0| - vLLM version: main - vLLM main: vllm-project/vllm@e93f4cc Signed-off-by: jiangpeng36 <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: jiangpeng36 <[email protected]> Co-authored-by: Ronald1995 <[email protected]>

…async-sched

# Conflicts: # vllm/executor/uniproc_executor.py

WoosukKwon

LGTM!

…4219) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

…4219) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: bbartels <[email protected]>

This PR is based on top of [#23569](vllm-project/vllm#23569) and [#24219](vllm-project/vllm#24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: ||forward async| scheduler async |sync| |-|-|-|-| |avg|41.73|41.86|44.20| |improve0|0.3%|0|0| |improve1|5.58%|0|0| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: ||forward async|sync| |-|-|-| |avg|23.22|29.16| |improve|20.3%|0| - vLLM version: main - vLLM main: vllm-project/vllm@e93f4cc Signed-off-by: jiangpeng36 <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: jiangpeng36 <[email protected]> Co-authored-by: Ronald1995 <[email protected]>

…4219) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

This PR is based on top of [#23569](vllm-project/vllm#23569) and [#24219](vllm-project/vllm#24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: ||forward async| scheduler async |sync| |-|-|-|-| |avg|41.73|41.86|44.20| |improve0|0.3%|0|0| |improve1|5.58%|0|0| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: ||forward async|sync| |-|-|-| |avg|23.22|29.16| |improve|20.3%|0| - vLLM version: main - vLLM main: vllm-project/vllm@e93f4cc Signed-off-by: jiangpeng36 <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: jiangpeng36 <[email protected]> Co-authored-by: Ronald1995 <[email protected]>

…4219) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…4219) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

This PR is based on top of [#23569](vllm-project/vllm#23569) and [#24219](vllm-project/vllm#24219). ### What this PR does / why we need it? This PR allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This diff is functional and does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? server ``` python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\ --trust-remote-code --enforce-eager \ --distributed-executor-backend=mp \ -tp=4 \ --port 8006 \ --max-model-len 32000 \ --block-size 128 \ --gpu-memory-utilization 0.99 ``` client ``` python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \ --dataset-name random --random-input-len 2048 --random-output-len 2048 \ --ignore-eos\ --num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \ --metric-percentiles 90 --base-url http://localhost:8006 --save-result \ --result-dir $PROFILER_DIR ``` benchmark test based on Qwen3-32B TPOT result: ||forward async| scheduler async |sync| |-|-|-|-| |avg|41.73|41.86|44.20| |improve0|0.3%|0|0| |improve1|5.58%|0|0| benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result: ||forward async|sync| |-|-|-| |avg|23.22|29.16| |improve|20.3%|0| - vLLM version: main - vLLM main: vllm-project/vllm@e93f4cc Signed-off-by: jiangpeng36 <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: jiangpeng36 <[email protected]> Co-authored-by: Ronald1995 <[email protected]>

…4219) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Ronald1995 <[email protected]> Co-authored-by: Ronald1995 <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…pc() (#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <[email protected]>

…pc() (volcengine#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <[email protected]>

mergify bot added the v1 label Sep 4, 2025

njhill mentioned this pull request Sep 4, 2025

[Performance] implement async_scheduling in single process mode #23914

Closed

5 tasks

njhill force-pushed the uniproc-async-sched branch from 34b91a0 to db4f6c4 Compare September 4, 2025 16:28

jiangpeng36 mentioned this pull request Sep 5, 2025

[Perf][V1] Fully overlap model execution vllm-project/vllm-ascend#2783

Merged

njhill added 2 commits September 5, 2025 19:32

[Core] Support async scheduling with uniproc executor

79ccd33

Signed-off-by: Nick Hill <[email protected]>

Include ExecutorWithExternalLauncher

cca2fab

Signed-off-by: Nick Hill <[email protected]>

njhill force-pushed the uniproc-async-sched branch from 9b5e75a to cca2fab Compare September 6, 2025 02:59

Ronald1995 and others added 2 commits September 5, 2025 20:18

fix compatibility with InprocClient

a1628e0

Signed-off-by: Ronald1995 <[email protected]> Signed-off-by: Nick Hill <[email protected]>

still default to MP executor with async scheduling

e8be669

Signed-off-by: Nick Hill <[email protected]>

njhill marked this pull request as ready for review September 6, 2025 03:31

njhill requested review from WoosukKwon, alexm-redhat, comaniac, robertgshaw2-redhat and ywang96 as code owners September 6, 2025 03:31

njhill added 2 commits September 5, 2025 22:38

simplify

4100ad6

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into uniproc-…

f814207

…async-sched

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 8, 2025

njhill added the needs-tests Tests needed for this PR label Sep 8, 2025

Merge remote-tracking branch 'refs/remotes/origin/main' into uniproc-…

723518a

…async-sched

njhill added 2 commits September 12, 2025 10:59

Merge remote-tracking branch 'origin/main' into uniproc-async-sched

36a4ee0

# Conflicts: # vllm/executor/uniproc_executor.py

Merge remote-tracking branch 'origin/main' into uniproc-async-sched

8ed9734

mergify bot removed the needs-rebase label Sep 12, 2025

Merge branch 'main' into uniproc-async-sched

bb76f66

WoosukKwon approved these changes Sep 12, 2025

View reviewed changes

WoosukKwon merged commit 4fdd6f5 into vllm-project:main Sep 12, 2025
44 checks passed

njhill deleted the uniproc-async-sched branch September 12, 2025 23:35

sixiang-google mentioned this pull request Sep 15, 2025

[Disagg] Update Disagg Executor to align with upstream vllm vllm-project/tpu-inference#698

Merged

njhill mentioned this pull request Oct 16, 2025

[BugFix][Core] Fix error when enable async-scheduling in multi-node env #25887

Merged

kevssim mentioned this pull request Oct 28, 2025

[rollout] fix: Add "non_block" argument compatibility to collective_rpc() volcengine/verl#3934

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Support async scheduling with uniproc executor #24219

[Core] Support async scheduling with uniproc executor #24219

Uh oh!

njhill commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

weijinqian0 commented Sep 5, 2025

Uh oh!

Ronald1995 commented Sep 5, 2025

Uh oh!

njhill commented Sep 5, 2025

Uh oh!

Ronald1995 commented Sep 6, 2025 •

edited

Loading

Uh oh!

njhill commented Sep 6, 2025

Uh oh!

Ronald1995 commented Sep 6, 2025

Uh oh!

njhill commented Sep 8, 2025

Uh oh!

Ronald1995 commented Sep 9, 2025 •

edited

Loading

Uh oh!

Ronald1995 commented Sep 10, 2025

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Core] Support async scheduling with uniproc executor #24219

[Core] Support async scheduling with uniproc executor #24219

Uh oh!

Conversation

njhill commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weijinqian0 commented Sep 5, 2025

Uh oh!

Ronald1995 commented Sep 5, 2025

Uh oh!

njhill commented Sep 5, 2025

Uh oh!

Ronald1995 commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Sep 6, 2025

Uh oh!

Ronald1995 commented Sep 6, 2025

Uh oh!

njhill commented Sep 8, 2025

Uh oh!

Ronald1995 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ronald1995 commented Sep 10, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

njhill commented Sep 4, 2025 •

edited by github-actions bot

Loading

Ronald1995 commented Sep 6, 2025 •

edited

Loading

Ronald1995 commented Sep 9, 2025 •

edited

Loading