Skip to content

Conversation

wallashss
Copy link
Collaborator

Description

This PR is a follow up for #478

It updates the remaining tests with the function check_output_against_hf and replace by the new validate_vllm_vs_hf_output that does "everything": it generates for vllm and hf, where the results of hf are used as baselines for the golden token injection, afterwards it compares the results of both systems.

I also included these envs, to get more flexibility to setup the tests:

  • VLLM_SPYRE_TEST_DISABLE_GOLDEN_TOKEN - to disable the golden token injection
  • VLLM_SPYRE_TEST_ABS_TOL - override the constant ISCLOSE_ABS_TOL
  • VLLM_SPYRE_TEST_QUANTIZED_ABS_TOL override the constant ISCLOSE_ABS_TOL_QUANTIZATION

Copy link

github-actions bot commented Oct 7, 2025

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@wallashss
Copy link
Collaborator Author

bot:test

Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
seqs_max_tokens = [10, 13, 2]
prompts_lengths = [49, 41, 5]
seqs_max_tokens = [33, 36, 2]
prompts_lengths = [49, 41, 32]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these values need to be updated? I think we had tried to keep these pretty small because when running on cpu it takes quite a while to generate tokens. The cb tests used to take almost 20 minutes on github actions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops... that's a good question.

Because of this optimization: #262

Specifically this line:

n_fully_padded_blocks = math.floor(

We have to use longer prompts and move forward tkv to match the expectations of num of blocks. That's the only way that found to update this test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But... maybe I can revert the test of this file. We won't run with full model anyway. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick benchmark to verify the impact in tests as suggested by @maxdebayser :

I ran in a dev pod:

python -m pytest -vs --disable-warnings --forked tests/e2e/test_spyre_cb_scheduler_steps.py -m cpu 

Original

========== 34 passed, 68 deselected, 72 warnings in 720.32s (0:12:00) ==========
========== 34 passed, 68 deselected, 72 warnings in 584.42s (0:09:44) ==========
========== 34 passed, 68 deselected, 72 warnings in 516.96s (0:08:36) ==========

Updated

========== 34 passed, 68 deselected, 72 warnings in 538.94s (0:08:58) ==========
========== 34 passed, 68 deselected, 72 warnings in 541.26s (0:09:01) ==========
========== 34 passed, 68 deselected, 72 warnings in 544.73s (0:09:04) ==========

Interesting that there was a run that long more than expected, but looks like it was an outlier run.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still very confused 😕

Are you saying that

  1. these tests need to be updated to properly test that optimization, or
  2. that they need to be updated in order to add the golden token injection?

If (1) then we should separate that change into a different PR, if (2) then I don't get why adding GTI would cause the smaller sequences to fail

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nice benchmark! For -m cpu we should run without --forked, but even then the cpus on GHA are way slower than the ones on our clusters so we won't really see comparable results

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh but these tests aren't even using golden token injection? Because we're not using validate_vllm_vs_hf_output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still very confused 😕

Are you saying that

  1. these tests need to be updated to properly test that optimization, or
  2. that they need to be updated in order to add the golden token injection?

If (1) then we should separate that change into a different PR, if (2) then I don't get why adding GTI would cause the smaller sequences to fail

Sorry for that, I think it is more for (1). I needed to update them to have more context so we could have less error with GTI when using quantized full model.

Oh but these tests aren't even using golden token injection? Because we're not using validate_vllm_vs_hf_output?

They don't use this new method, they are already using GTI (it is already on main), since they generate using the engine directly to get information step by step , they won't be compatible with validate_vllm_vs_hf_output that will use the LLM class to generate.

Since, there's a lot of caveat in this class, I am thinking in revert this class (which already has GTI) and we can plan better later to actually get more benefits of the GTI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, gotcha. Thanks for the detailed explanation!

@wallashss
Copy link
Collaborator Author

bot:test

@wallashss wallashss requested a review from joerunde October 10, 2025 00:21
Signed-off-by: Wallas Santos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants