🧪 add failing full batch cb test #504

joerunde · 2025-10-03T16:36:40Z

Description

Adds a (failing) test that shows full-context requests blocking scheduling.

Unclear if this is correct behavior due to homogeneous TKV constraints or not.

Signed-off-by: Joe Runde <[email protected]>

github-actions · 2025-10-03T16:36:50Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

yannicks1 · 2025-10-09T07:52:45Z

tests/e2e/test_spyre_basic.py

+
+    # Add a full batch of requests to the engine
+    # Requests must be full-length: prompt_len + max_tokens = max_model_len
+    vllm_sampling_params = SamplingParams(max_tokens=168,


Suggested change

vllm_sampling_params = SamplingParams(max_tokens=168,

vllm_sampling_params = SamplingParams(max_tokens=192,

needs to be multiple of block size (64) for the test to work!

What is currently happening is:

max_tokens = 168

prompt length = 256 - 168 = 88

prompt gets padded from 88 to 128

so in total the sequence is of length 128 + 168 = 296 > max model length (256)

That is why the 2nd through 4th request cannot be scheduled with the first one! And the 3rd - 4th cannot be scheduled with the 2nd, respectively...

Side Note: by directly doing engine_core.add_request, we don't go into platform.validate_request(), hence the too long requests don't get rejected in the first place. instead they get scheduled one by one: we do always schedule the head of the waiting queue if the batch is empty (because every request passed through validate_request() and is therefore valid).

oh, super interesting! I guess I was only running this on cpu too so I wasn't running into any compiler errors either about running past the end of the sequence.

Are you seeing that if you set max_tokens=192 here then the full batch schedules properly?

🧪 add failing full batch cb test

94c7bef

Signed-off-by: Joe Runde <[email protected]>

yannicks1 requested changes Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🧪 add failing full batch cb test #504

🧪 add failing full batch cb test #504

Uh oh!

joerunde commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

yannicks1 Oct 9, 2025

Uh oh!

joerunde Oct 9, 2025

Uh oh!

yannicks1 Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	vllm_sampling_params = SamplingParams(max_tokens=168,
	vllm_sampling_params = SamplingParams(max_tokens=192,

🧪 add failing full batch cb test #504

Are you sure you want to change the base?

🧪 add failing full batch cb test #504

Uh oh!

Conversation

joerunde commented Oct 3, 2025

Description

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

yannicks1 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

yannicks1 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants