Skip to content

Conversation

@hasB4K
Copy link
Contributor

@hasB4K hasB4K commented Dec 2, 2025

Purpose

We encountered a "bug" with P/D disaggregation in our pre-prod env a while ago. Here is what we think happened:

Some requests were canceled on Prefill by our router/proxy because of a Timeout (let's say 20s here). Usually a Timeout usually stop/abort a request entirely (which is what we want). But with P/D disagg, you can be unlucky, if you have a lot of requests that will take 20s to prefill if at the same time:

  • At the same time:
    • The requests finish the Prefill in 20s
    • The request timeout after 20s! And you trigger an abort
  • But (on main) the request is unfree-able because the NIXL KVConnector considers that requests with FINISHED_LENGTH_CAPPED should be delayed (which is what we want in most cases, but not when we abort a request)
  • => this request is now in an un-freeable state for 5 minutes (ie. the default VLLM_NIXL_ABORT_REQUEST_TIMEOUT)
  • The requests will be retried on another instance, and create the same issues
  • If you are unlucky with a lot of requests similar, you exhaust the available KV Cache, and you start rejecting requests which creates a denial of service for a time 😕

This patch, force the removal of a delayed request when an abort happen

(cc @NickLucche)

…L_ABORT_REQUEST_TIMEOUT

Signed-off-by: Mathis Felardos <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where requests are not freed correctly upon timeout, potentially leading to KV cache exhaustion. The fix introduces logic to forcefully abort requests that are already finished but might be subject to a delayed free by the KV connector. A new test case is added to validate this behavior. My review focuses on improving the robustness of this fix. I've suggested expanding the logic to cover all relevant 'properly finished' states, not just FINISHED_LENGTH_CAPPED, to prevent similar bugs with other completion statuses. I've also recommended parameterizing the new test to ensure comprehensive coverage for the enhanced logic.

elif request.is_finished():
if (
should_force_abort
and request.status == RequestStatus.FINISHED_LENGTH_CAPPED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic to force an abort on a request that might be subject to delayed free only considers requests with the status FINISHED_LENGTH_CAPPED. However, a request that is FINISHED_STOPPED could also be considered 'finished properly' and be eligible for delayed free by a KV connector. To make the fix more robust and cover all such cases, it would be better to handle both FINISHED_LENGTH_CAPPED and FINISHED_STOPPED statuses. This would also require updating the new test case to cover this additional status.

Suggested change
and request.status == RequestStatus.FINISHED_LENGTH_CAPPED
and request.status in (RequestStatus.FINISHED_LENGTH_CAPPED, RequestStatus.FINISHED_STOPPED)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked offline about this, not really a case for P (it outputs just one token)

Comment on lines +65 to +100
def test_abort_request_with_kv_connector():
# `use_kv_connector=True` will expose a kv_connector to the scheduler, but
# we will need to mimick the delay_freed since the default kv_connector is
# too simple
scheduler = create_scheduler(use_kv_connector=True)
requests = create_requests(num_requests=10)
for request in requests:
scheduler.add_request(request)

with patch.object(
scheduler,
"_connector_finished",
side_effect=lambda req: (
req.status == RequestStatus.FINISHED_LENGTH_CAPPED,
{"fake_kv_params": False},
),
):
for i, request in enumerate(requests):
scheduler.finish_requests(
request.request_id, RequestStatus.FINISHED_LENGTH_CAPPED
)
assert request.request_id in scheduler.requests # since delayed
assert len(scheduler.waiting) == 9 - i

assert not scheduler.waiting and not scheduler.running
assert len(scheduler.requests) == 10

for i, request in enumerate(requests):
scheduler.finish_requests(
request.request_id, RequestStatus.FINISHED_ABORTED
)
assert request.request_id not in scheduler.requests # since aborted

assert not scheduler.waiting and not scheduler.running
assert not scheduler.requests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is a great test for the new force-abort logic. To align with the suggestion to handle all 'properly finished' statuses in scheduler.py, I recommend parameterizing this test to run for both RequestStatus.FINISHED_LENGTH_CAPPED and RequestStatus.FINISHED_STOPPED. This will ensure the fix is robust and covers all scenarios where a request might be subject to delayed free.

Here's an example of how you could structure the parameterized test:

@pytest.mark.parametrize(
    "finish_status",
    [RequestStatus.FINISHED_LENGTH_CAPPED, RequestStatus.FINISHED_STOPPED],
)
def test_abort_request_with_kv_connector(finish_status):
    # ... setup ...
    with patch.object(
        scheduler,
        "_connector_finished",
        side_effect=lambda req: (
            req.status == finish_status,
            {"fake_kv_params": False},
        ),
    ):
        # ... first loop ...
        scheduler.finish_requests(
            request.request_id, finish_status
        )
        # ... assertions ...

        # ... second loop ...

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into more structural changes right now. Related to "passive" #26400 or even "active" solution such as #26635 .

The former in particular would explain why this case is happening more frequently than anticipated.

Comment on lines +1329 to +1334
# this is only required only if we have a kv connector
should_force_abort = (
finished_status == RequestStatus.FINISHED_ABORTED
and self.get_kv_connector() is not None
)
forced_aborted_requests = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried to set the timeout to a much lower value? if I understand correctly, this fix is freeing requests asap, which should be similar. Not sure if this is optimal in the case D is hogged as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another todo we have is to propagate the deadline from P to D to make smaller timeouts "safe". Otherwise D could pull invalid blocks in this case. This should be a pretty simple change

@njhill
Copy link
Member

njhill commented Dec 3, 2025

Thanks for reporting @hasB4K. I think you are hitting the issue described in #26400, that if request cancellations happen during a forward pass, any requests that complete normally following that forward pass won't be handled as aborts (which will be all requests in disagg prefill case, apart from chunks).

I can see the change in this PR is attempting to address this but I don't think it will actually work since the problem is that the scheduler finish_requests method will be called too late (only after the model output from the in-progress step has already been processed). I'm curious whether you were able to verify that this change actually fixes the issue (would be very surprised if it does).

I have in mind what's needed to fix this, will aim to open a PR for it today.

@njhill
Copy link
Member

njhill commented Dec 3, 2025

Here you go, not yet tested :) #29987

@robertgshaw2-redhat
Copy link
Collaborator

closing in favor of nick's pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants