-
Notifications
You must be signed in to change notification settings - Fork 51
Fix issue with async_scheduling when dealing with chunked input #360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix issue with async_scheduling when dealing with chunked input #360
Conversation
Signed-off-by: Tianmu Li <[email protected]>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Tianmu Li <[email protected]>
e2cc7ce
to
f075944
Compare
if structured_output or self.use_async_scheduling: | ||
logits_append = torch.tensor([torch.sum(prompt_len) - 1], | ||
device=token_ids.device, | ||
dtype=torch.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get this part, why torch.sum(prompt_len) - 1
instead something like len(req_id) - logits_indices.shape[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why logits_indices is shorter, because we skipped num_decodes right? Why padding is append after not before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for chunked prompt, where it shouldn't generate a new token yet. In gpu_model_runner, a new token still gets generated for the incomplete prompt and then gets discarded. This is to align to that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on the fact that there can only be one incomplete prompt, and that prompt is always the last one if it exists.
if self.use_async_scheduling: | ||
# Discard partial prefill logit for async scheduling | ||
# Depends on 1 decode token/batch | ||
invalid_req_indices.append(num_decodes + idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe do something as, so it will be easier to understand?
prefill_start_idx = num_decodes
invalid_req_indices.append(prefill_start_idx + idx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added clarification.
Signed-off-by: Tianmu Li <[email protected]>
/run-gaudi-tests |
/run-gaudi-tests |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Cherry-pick of #359