[BugFix] Eagerly abort cancelled final-step requests #29987
+34
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, when requests are cancelled while executing their final step, "completion" of those requests is subsequently handled based on normal stop processing (e.g. length or stop token), and so the abort essentially has no effect.
This is typically not a problem since the final output would be ignored/discarded in this case anyhow. When a kv connector is involved however, it means that the connector will think the request completed successfully rather than being aborted.
This has turned out to be problematic for disaggregated prefill which will free the kv cache blocks if the request was aborted but not if it thinks the request has completed successfully. Since the top-level request was cancelled, it will never be sent to the decode side and so the kv cache blocks remain pinned unnecessarily until the fall-back timeout expires.
The problem is exacerbated when a large number of requests are cancelled and/or there are large prefills whose forward pass takes a long time (since the window for this to occur is bigger).
This PR fixes the problem by explicitly processing any pending aborts immediately prior to processing the model output each step. We process only the aborts and not new requests since it's still preferable for latency reasons to process the model outputs before new incoming requests.
Fixes #26400.