[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906

hasB4K · 2025-12-02T18:05:47Z

Purpose

We encountered a "bug" with P/D disaggregation in our pre-prod env a while ago. Here is what we think happened:

Some requests were canceled on Prefill by our router/proxy because of a Timeout (let's say 20s here). Usually a Timeout usually stop/abort a request entirely (which is what we want). But with P/D disagg, you can be unlucky, if you have a lot of requests that will take 20s to prefill if at the same time:

At the same time:
- The requests finish the Prefill in 20s
- The request timeout after 20s! And you trigger an abort
But (on main) the request is unfree-able because the NIXL KVConnector considers that requests with FINISHED_LENGTH_CAPPED should be delayed (which is what we want in most cases, but not when we abort a request)
=> this request is now in an un-freeable state for 5 minutes (ie. the default VLLM_NIXL_ABORT_REQUEST_TIMEOUT)
The requests will be retried on another instance, and create the same issues
If you are unlucky with a lot of requests similar, you exhaust the available KV Cache, and you start rejecting requests which creates a denial of service for a time 😕

This patch, force the removal of a delayed request when an abort happen

(cc @NickLucche)

…L_ABORT_REQUEST_TIMEOUT Signed-off-by: Mathis Felardos <[email protected]>

gemini-code-assist

Code Review

This pull request addresses a bug where requests are not freed correctly upon timeout, potentially leading to KV cache exhaustion. The fix introduces logic to forcefully abort requests that are already finished but might be subject to a delayed free by the KV connector. A new test case is added to validate this behavior. My review focuses on improving the robustness of this fix. I've suggested expanding the logic to cover all relevant 'properly finished' states, not just FINISHED_LENGTH_CAPPED, to prevent similar bugs with other completion statuses. I've also recommended parameterizing the new test to ensure comprehensive coverage for the enhanced logic.

gemini-code-assist · 2025-12-02T18:07:36Z

vllm/v1/core/sched/scheduler.py

+            elif request.is_finished():
+                if (
+                    should_force_abort
+                    and request.status == RequestStatus.FINISHED_LENGTH_CAPPED


The current logic to force an abort on a request that might be subject to delayed free only considers requests with the status FINISHED_LENGTH_CAPPED. However, a request that is FINISHED_STOPPED could also be considered 'finished properly' and be eligible for delayed free by a KV connector. To make the fix more robust and cover all such cases, it would be better to handle both FINISHED_LENGTH_CAPPED and FINISHED_STOPPED statuses. This would also require updating the new test case to cover this additional status.

Suggested change

and request.status == RequestStatus.FINISHED_LENGTH_CAPPED

and request.status in (RequestStatus.FINISHED_LENGTH_CAPPED, RequestStatus.FINISHED_STOPPED)

talked offline about this, not really a case for P (it outputs just one token)

gemini-code-assist · 2025-12-02T18:07:36Z

tests/v1/core/test_scheduler.py

+def test_abort_request_with_kv_connector():
+    # `use_kv_connector=True` will expose a kv_connector to the scheduler, but
+    # we will need to mimick the delay_freed since the default kv_connector is
+    # too simple
+    scheduler = create_scheduler(use_kv_connector=True)
+    requests = create_requests(num_requests=10)
+    for request in requests:
+        scheduler.add_request(request)
+
+    with patch.object(
+        scheduler,
+        "_connector_finished",
+        side_effect=lambda req: (
+            req.status == RequestStatus.FINISHED_LENGTH_CAPPED,
+            {"fake_kv_params": False},
+        ),
+    ):
+        for i, request in enumerate(requests):
+            scheduler.finish_requests(
+                request.request_id, RequestStatus.FINISHED_LENGTH_CAPPED
+            )
+            assert request.request_id in scheduler.requests  # since delayed
+            assert len(scheduler.waiting) == 9 - i
+
+        assert not scheduler.waiting and not scheduler.running
+        assert len(scheduler.requests) == 10
+
+        for i, request in enumerate(requests):
+            scheduler.finish_requests(
+                request.request_id, RequestStatus.FINISHED_ABORTED
+            )
+            assert request.request_id not in scheduler.requests  # since aborted
+
+        assert not scheduler.waiting and not scheduler.running
+        assert not scheduler.requests
+


This is a great test for the new force-abort logic. To align with the suggestion to handle all 'properly finished' statuses in scheduler.py, I recommend parameterizing this test to run for both RequestStatus.FINISHED_LENGTH_CAPPED and RequestStatus.FINISHED_STOPPED. This will ensure the fix is robust and covers all scenarios where a request might be subject to delayed free.

Here's an example of how you could structure the parameterized test:

@pytest.mark.parametrize( "finish_status", [RequestStatus.FINISHED_LENGTH_CAPPED, RequestStatus.FINISHED_STOPPED], ) def test_abort_request_with_kv_connector(finish_status): # ... setup ... with patch.object( scheduler, "_connector_finished", side_effect=lambda req: ( req.status == finish_status, {"fake_kv_params": False}, ), ): # ... first loop ... scheduler.finish_requests( request.request_id, finish_status ) # ... assertions ... # ... second loop ...

NickLucche

Looking into more structural changes right now. Related to "passive" #26400 or even "active" solution such as #26635 .

The former in particular would explain why this case is happening more frequently than anticipated.

NickLucche · 2025-12-03T16:55:18Z

vllm/v1/core/sched/scheduler.py

+        # this is only required only if we have a kv connector
+        should_force_abort = (
+            finished_status == RequestStatus.FINISHED_ABORTED
+            and self.get_kv_connector() is not None
+        )
+        forced_aborted_requests = []


have you tried to set the timeout to a much lower value? if I understand correctly, this fix is freeing requests asap, which should be similar. Not sure if this is optimal in the case D is hogged as well

Another todo we have is to propagate the deadline from P to D to make smaller timeouts "safe". Otherwise D could pull invalid blocks in this case. This should be a pretty simple change

njhill · 2025-12-03T17:33:50Z

Thanks for reporting @hasB4K. I think you are hitting the issue described in #26400, that if request cancellations happen during a forward pass, any requests that complete normally following that forward pass won't be handled as aborts (which will be all requests in disagg prefill case, apart from chunks).

I can see the change in this PR is attempting to address this but I don't think it will actually work since the problem is that the scheduler finish_requests method will be called too late (only after the model output from the in-progress step has already been processed). I'm curious whether you were able to verify that this change actually fixes the issue (would be very surprised if it does).

I have in mind what's needed to fix this, will aim to open a PR for it today.

njhill · 2025-12-03T18:27:56Z

Here you go, not yet tested :) #29987

robertgshaw2-redhat · 2025-12-04T15:41:13Z

closing in favor of nick's pr

[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIX…

6c2eb87

…L_ABORT_REQUEST_TIMEOUT Signed-off-by: Mathis Felardos <[email protected]>

hasB4K requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners December 2, 2025 18:05

mergify bot added v1 kv-connector labels Dec 2, 2025

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

NickLucche reviewed Dec 3, 2025

View reviewed changes

robertgshaw2-redhat closed this Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906

[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906

Uh oh!

hasB4K commented Dec 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Uh oh!

NickLucche Dec 3, 2025

Uh oh!

gemini-code-assist bot Dec 2, 2025

Uh oh!

NickLucche left a comment •

edited

Loading

Uh oh!

NickLucche Dec 3, 2025

Uh oh!

njhill Dec 3, 2025

Uh oh!

njhill commented Dec 3, 2025 •

edited

Loading

Uh oh!

njhill commented Dec 3, 2025

Uh oh!

robertgshaw2-redhat commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	and request.status == RequestStatus.FINISHED_LENGTH_CAPPED
	and request.status in (RequestStatus.FINISHED_LENGTH_CAPPED, RequestStatus.FINISHED_STOPPED)

Uh oh!

[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906

[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906

Uh oh!

Conversation

hasB4K commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Dec 3, 2025

Uh oh!

robertgshaw2-redhat commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hasB4K commented Dec 2, 2025 •

edited by github-actions bot

Loading

NickLucche left a comment •

edited

Loading

njhill commented Dec 3, 2025 •

edited

Loading