-
Notifications
You must be signed in to change notification settings - Fork 26
test: update tests to use golden token injection #510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
bot:test |
Signed-off-by: Wallas Santos <[email protected]>
…-gti-tests Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
58dc800
to
20eb7cc
Compare
seqs_max_tokens = [10, 13, 2] | ||
prompts_lengths = [49, 41, 5] | ||
seqs_max_tokens = [33, 36, 2] | ||
prompts_lengths = [49, 41, 32] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these values need to be updated? I think we had tried to keep these pretty small because when running on cpu it takes quite a while to generate tokens. The cb
tests used to take almost 20 minutes on github actions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops... that's a good question.
Because of this optimization: #262
Specifically this line:
n_fully_padded_blocks = math.floor( |
We have to use longer prompts and move forward tkv to match the expectations of num of blocks. That's the only way that found to update this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But... maybe I can revert the test of this file. We won't run with full model anyway. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick benchmark to verify the impact in tests as suggested by @maxdebayser :
I ran in a dev pod:
python -m pytest -vs --disable-warnings --forked tests/e2e/test_spyre_cb_scheduler_steps.py -m cpu
Original
========== 34 passed, 68 deselected, 72 warnings in 720.32s (0:12:00) ==========
========== 34 passed, 68 deselected, 72 warnings in 584.42s (0:09:44) ==========
========== 34 passed, 68 deselected, 72 warnings in 516.96s (0:08:36) ==========
Updated
========== 34 passed, 68 deselected, 72 warnings in 538.94s (0:08:58) ==========
========== 34 passed, 68 deselected, 72 warnings in 541.26s (0:09:01) ==========
========== 34 passed, 68 deselected, 72 warnings in 544.73s (0:09:04) ==========
Interesting that there was a run that long more than expected, but looks like it was an outlier run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still very confused 😕
Are you saying that
- these tests need to be updated to properly test that optimization, or
- that they need to be updated in order to add the golden token injection?
If (1) then we should separate that change into a different PR, if (2) then I don't get why adding GTI would cause the smaller sequences to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, nice benchmark! For -m cpu
we should run without --forked
, but even then the cpus on GHA are way slower than the ones on our clusters so we won't really see comparable results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh but these tests aren't even using golden token injection? Because we're not using validate_vllm_vs_hf_output
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still very confused 😕
Are you saying that
- these tests need to be updated to properly test that optimization, or
- that they need to be updated in order to add the golden token injection?
If (1) then we should separate that change into a different PR, if (2) then I don't get why adding GTI would cause the smaller sequences to fail
Sorry for that, I think it is more for (1). I needed to update them to have more context so we could have less error with GTI when using quantized full model.
Oh but these tests aren't even using golden token injection? Because we're not using validate_vllm_vs_hf_output?
They don't use this new method, they are already using GTI (it is already on main), since they generate using the engine directly to get information step by step , they won't be compatible with validate_vllm_vs_hf_output that will use the LLM class to generate.
Since, there's a lot of caveat in this class, I am thinking in revert this class (which already has GTI) and we can plan better later to actually get more benefits of the GTI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, gotcha. Thanks for the detailed explanation!
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
bot:test |
Signed-off-by: Wallas Santos <[email protected]>
Description
This PR is a follow up for #478
It updates the remaining tests with the function
check_output_against_hf
and replace by the newvalidate_vllm_vs_hf_output
that does "everything": it generates for vllm and hf, where the results of hf are used as baselines for the golden token injection, afterwards it compares the results of both systems.I also included these envs, to get more flexibility to setup the tests:
VLLM_SPYRE_TEST_DISABLE_GOLDEN_TOKEN
- to disable the golden token injectionVLLM_SPYRE_TEST_ABS_TOL
- override the constantISCLOSE_ABS_TOL
VLLM_SPYRE_TEST_QUANTIZED_ABS_TOL
override the constantISCLOSE_ABS_TOL_QUANTIZATION