Skip to content

Conversation

tianmu-li
Copy link
Contributor

Cherry-pick of #359

Copy link

github-actions bot commented Oct 8, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@tianmu-li tianmu-li changed the title [WIP] Fix issue with async_scheduling when dealing with chunked input Fix issue with async_scheduling when dealing with chunked input Oct 8, 2025
@tianmu-li tianmu-li marked this pull request as ready for review October 8, 2025 18:44
@xuechendi xuechendi closed this Oct 8, 2025
@xuechendi xuechendi reopened this Oct 8, 2025
Signed-off-by: Tianmu Li <[email protected]>
@tianmu-li tianmu-li force-pushed the async_scheduling_chunk_fix_main branch from e2cc7ce to f075944 Compare October 8, 2025 19:34
if structured_output or self.use_async_scheduling:
logits_append = torch.tensor([torch.sum(prompt_len) - 1],
device=token_ids.device,
dtype=torch.int32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this part, why torch.sum(prompt_len) - 1 instead something like len(req_id) - logits_indices.shape[0]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why logits_indices is shorter, because we skipped num_decodes right? Why padding is append after not before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for chunked prompt, where it shouldn't generate a new token yet. In gpu_model_runner, a new token still gets generated for the incomplete prompt and then gets discarded. This is to align to that behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the fact that there can only be one incomplete prompt, and that prompt is always the last one if it exists.

if self.use_async_scheduling:
# Discard partial prefill logit for async scheduling
# Depends on 1 decode token/batch
invalid_req_indices.append(num_decodes + idx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe do something as, so it will be easier to understand?

prefill_start_idx = num_decodes
invalid_req_indices.append(prefill_start_idx  + idx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification.

@michalkuligowski
Copy link
Collaborator

/run-gaudi-tests

@michalkuligowski
Copy link
Collaborator

/run-gaudi-tests

Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
e39dc46f8fe61803032a5f51ba76f8fa03ba0b41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants