Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Dec 3, 2025

Currently, when requests are cancelled while executing their final step, "completion" of those requests is subsequently handled based on normal stop processing (e.g. length or stop token), and so the abort essentially has no effect.

This is typically not a problem since the final output would be ignored/discarded in this case anyhow. When a kv connector is involved however, it means that the connector will think the request completed successfully rather than being aborted.

This has turned out to be problematic for disaggregated prefill which will free the kv cache blocks if the request was aborted but not if it thinks the request has completed successfully. Since the top-level request was cancelled, it will never be sent to the decode side and so the kv cache blocks remain pinned unnecessarily until the fall-back timeout expires.

The problem is exacerbated when a large number of requests are cancelled and/or there are large prefills whose forward pass takes a long time (since the window for this to occur is bigger).


This PR fixes the problem by explicitly processing any pending aborts immediately prior to processing the model output each step. We process only the aborts and not new requests since it's still preferable for latency reasons to process the model outputs before new incoming requests.

Fixes #26400.

@robertgshaw2-redhat
Copy link
Collaborator

Could you provide some more detailed explanation about what was happening before + why this fixes it?

This is pretty complicated logic so I think we will value the posterity

Signed-off-by: Nick Hill <[email protected]>
@njhill
Copy link
Member Author

njhill commented Dec 3, 2025

@robertgshaw2-redhat I've now added some explanations.

Signed-off-by: Nick Hill <[email protected]>
@njhill njhill marked this pull request as ready for review December 3, 2025 20:02
@njhill njhill changed the title [BugFix] Eagerly abort final-step requests [BugFix] Eagerly abort cancelled final-step requests Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Engine Core] Process pending requests in-between model execution and update_from_outputs

2 participants