-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[Bugfix] Free requests to avoid a KV Cache exhaustion during VLLM_NIXL_ABORT_REQUEST_TIMEOUT #29906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -1326,11 +1326,31 @@ def finish_requests( | |||||
| waiting_requests_to_remove = [] | ||||||
| valid_requests = [] | ||||||
|
|
||||||
| # this is only required only if we have a kv connector | ||||||
| should_force_abort = ( | ||||||
| finished_status == RequestStatus.FINISHED_ABORTED | ||||||
| and self.get_kv_connector() is not None | ||||||
| ) | ||||||
| forced_aborted_requests = [] | ||||||
|
Comment on lines
+1329
to
+1334
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. have you tried to set the timeout to a much lower value? if I understand correctly, this fix is freeing requests asap, which should be similar. Not sure if this is optimal in the case D is hogged as well
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another todo we have is to propagate the deadline from P to D to make smaller timeouts "safe". Otherwise D could pull invalid blocks in this case. This should be a pretty simple change |
||||||
|
|
||||||
| # First pass: collect requests to remove from queues | ||||||
| for req_id in request_ids: | ||||||
| request = self.requests.get(req_id) | ||||||
| if request is None or request.is_finished(): | ||||||
| # Invalid request ID. | ||||||
| if request is None: | ||||||
| continue # Invalid request ID. | ||||||
| elif request.is_finished(): | ||||||
| if ( | ||||||
| should_force_abort | ||||||
| and request.status == RequestStatus.FINISHED_LENGTH_CAPPED | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current logic to force an abort on a request that might be subject to delayed free only considers requests with the status
Suggested change
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. talked offline about this, not really a case for P (it outputs just one token) |
||||||
| ): | ||||||
| # we need to force the status to FINISHED_ABORTED to avoid | ||||||
| # the request being delayed freed. The kv_connector will | ||||||
| # delay the free if it the status is FINISHED_LENGTH_CAPPED | ||||||
| logger.info( | ||||||
| "Request %s is finished but will get forced aborted.", | ||||||
| req_id, | ||||||
| ) | ||||||
| forced_aborted_requests.append(request) | ||||||
| continue | ||||||
|
|
||||||
| valid_requests.append(request) | ||||||
|
|
@@ -1350,6 +1370,11 @@ def finish_requests( | |||||
| request.status = finished_status | ||||||
| self._free_request(request) | ||||||
|
|
||||||
| # Free the requests that are being delayed | ||||||
| for request in forced_aborted_requests: | ||||||
| request.status = finished_status | ||||||
| self._free_request(request) | ||||||
|
|
||||||
| def _free_request(self, request: Request) -> dict[str, Any] | None: | ||||||
| assert request.is_finished() | ||||||
|
|
||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great test for the new force-abort logic. To align with the suggestion to handle all 'properly finished' statuses in
scheduler.py, I recommend parameterizing this test to run for bothRequestStatus.FINISHED_LENGTH_CAPPEDandRequestStatus.FINISHED_STOPPED. This will ensure the fix is robust and covers all scenarios where a request might be subject to delayed free.Here's an example of how you could structure the parameterized test: