-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…L_ABORT_REQUEST_TIMEOUT Signed-off-by: Mathis Felardos <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug where requests are not freed correctly upon timeout, potentially leading to KV cache exhaustion. The fix introduces logic to forcefully abort requests that are already finished but might be subject to a delayed free by the KV connector. A new test case is added to validate this behavior. My review focuses on improving the robustness of this fix. I've suggested expanding the logic to cover all relevant 'properly finished' states, not just FINISHED_LENGTH_CAPPED, to prevent similar bugs with other completion statuses. I've also recommended parameterizing the new test to ensure comprehensive coverage for the enhanced logic.
| elif request.is_finished(): | ||
| if ( | ||
| should_force_abort | ||
| and request.status == RequestStatus.FINISHED_LENGTH_CAPPED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic to force an abort on a request that might be subject to delayed free only considers requests with the status FINISHED_LENGTH_CAPPED. However, a request that is FINISHED_STOPPED could also be considered 'finished properly' and be eligible for delayed free by a KV connector. To make the fix more robust and cover all such cases, it would be better to handle both FINISHED_LENGTH_CAPPED and FINISHED_STOPPED statuses. This would also require updating the new test case to cover this additional status.
| and request.status == RequestStatus.FINISHED_LENGTH_CAPPED | |
| and request.status in (RequestStatus.FINISHED_LENGTH_CAPPED, RequestStatus.FINISHED_STOPPED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked offline about this, not really a case for P (it outputs just one token)
| def test_abort_request_with_kv_connector(): | ||
| # `use_kv_connector=True` will expose a kv_connector to the scheduler, but | ||
| # we will need to mimick the delay_freed since the default kv_connector is | ||
| # too simple | ||
| scheduler = create_scheduler(use_kv_connector=True) | ||
| requests = create_requests(num_requests=10) | ||
| for request in requests: | ||
| scheduler.add_request(request) | ||
|
|
||
| with patch.object( | ||
| scheduler, | ||
| "_connector_finished", | ||
| side_effect=lambda req: ( | ||
| req.status == RequestStatus.FINISHED_LENGTH_CAPPED, | ||
| {"fake_kv_params": False}, | ||
| ), | ||
| ): | ||
| for i, request in enumerate(requests): | ||
| scheduler.finish_requests( | ||
| request.request_id, RequestStatus.FINISHED_LENGTH_CAPPED | ||
| ) | ||
| assert request.request_id in scheduler.requests # since delayed | ||
| assert len(scheduler.waiting) == 9 - i | ||
|
|
||
| assert not scheduler.waiting and not scheduler.running | ||
| assert len(scheduler.requests) == 10 | ||
|
|
||
| for i, request in enumerate(requests): | ||
| scheduler.finish_requests( | ||
| request.request_id, RequestStatus.FINISHED_ABORTED | ||
| ) | ||
| assert request.request_id not in scheduler.requests # since aborted | ||
|
|
||
| assert not scheduler.waiting and not scheduler.running | ||
| assert not scheduler.requests | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great test for the new force-abort logic. To align with the suggestion to handle all 'properly finished' statuses in scheduler.py, I recommend parameterizing this test to run for both RequestStatus.FINISHED_LENGTH_CAPPED and RequestStatus.FINISHED_STOPPED. This will ensure the fix is robust and covers all scenarios where a request might be subject to delayed free.
Here's an example of how you could structure the parameterized test:
@pytest.mark.parametrize(
"finish_status",
[RequestStatus.FINISHED_LENGTH_CAPPED, RequestStatus.FINISHED_STOPPED],
)
def test_abort_request_with_kv_connector(finish_status):
# ... setup ...
with patch.object(
scheduler,
"_connector_finished",
side_effect=lambda req: (
req.status == finish_status,
{"fake_kv_params": False},
),
):
# ... first loop ...
scheduler.finish_requests(
request.request_id, finish_status
)
# ... assertions ...
# ... second loop ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # this is only required only if we have a kv connector | ||
| should_force_abort = ( | ||
| finished_status == RequestStatus.FINISHED_ABORTED | ||
| and self.get_kv_connector() is not None | ||
| ) | ||
| forced_aborted_requests = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tried to set the timeout to a much lower value? if I understand correctly, this fix is freeing requests asap, which should be similar. Not sure if this is optimal in the case D is hogged as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another todo we have is to propagate the deadline from P to D to make smaller timeouts "safe". Otherwise D could pull invalid blocks in this case. This should be a pretty simple change
|
Thanks for reporting @hasB4K. I think you are hitting the issue described in #26400, that if request cancellations happen during a forward pass, any requests that complete normally following that forward pass won't be handled as aborts (which will be all requests in disagg prefill case, apart from chunks). I can see the change in this PR is attempting to address this but I don't think it will actually work since the problem is that the scheduler I have in mind what's needed to fix this, will aim to open a PR for it today. |
|
Here you go, not yet tested :) #29987 |
|
closing in favor of nick's pr |
Purpose
We encountered a "bug" with P/D disaggregation in our pre-prod env a while ago. Here is what we think happened:
Some requests were canceled on Prefill by our router/proxy because of a Timeout (let's say 20s here). Usually a Timeout usually stop/abort a request entirely (which is what we want). But with P/D disagg, you can be unlucky, if you have a lot of requests that will take 20s to prefill if at the same time:
This patch, force the removal of a delayed request when an abort happen
(cc @NickLucche)