Skip to content

Conversation

joerunde
Copy link
Collaborator

@joerunde joerunde commented Oct 3, 2025

Description

Adds a (failing) test that shows full-context requests blocking scheduling.

Unclear if this is correct behavior due to homogeneous TKV constraints or not.

Copy link

github-actions bot commented Oct 3, 2025

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀


# Add a full batch of requests to the engine
# Requests must be full-length: prompt_len + max_tokens = max_model_len
vllm_sampling_params = SamplingParams(max_tokens=168,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vllm_sampling_params = SamplingParams(max_tokens=168,
vllm_sampling_params = SamplingParams(max_tokens=192,

needs to be multiple of block size (64) for the test to work!

What is currently happening is:

  • max_tokens = 168
  • prompt length = 256 - 168 = 88
  • prompt gets padded from 88 to 128
  • so in total the sequence is of length 128 + 168 = 296 > max model length (256)

That is why the 2nd through 4th request cannot be scheduled with the first one! And the 3rd - 4th cannot be scheduled with the 2nd, respectively...

Side Note: by directly doing engine_core.add_request, we don't go into platform.validate_request(), hence the too long requests don't get rejected in the first place. instead they get scheduled one by one: we do always schedule the head of the waiting queue if the batch is empty (because every request passed through validate_request() and is therefore valid).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, super interesting! I guess I was only running this on cpu too so I wasn't running into any compiler errors either about running past the end of the sequence.

Are you seeing that if you set max_tokens=192 here then the full batch schedules properly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants