-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Model][Qwen3VL] Add torch.compile support for Qwen3VL
#27741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Lukas Geiger <[email protected]>
7576d4f to
29fb61c
Compare
|
I ran some benchmarks on a L40s and it looks like this change would increase memory usage. Previously I was able to run With this PR it seems like the maximum model length would decrease to @Lucaskabela Have you seen a similar behaviour for Qwen2.5 VL? Performance wise it also looks like throughput is worse: main: torch compiled: |
|
Hm I didn't observe the model length issues in my previous PR, as memory usage shouldn't increase during runtime (just compile time, unless we are doing some tricks here); the throughput decrease also seems odd to me since the Time per Output and ITL are both improving; seems the TTFT is dropping a bit here I wonder if there is some dimension we need to mark dynamic here? If we are recompiling, this could explain the higher TTFT/memory increase |
|
One way we can check is running tlparse and looking at the logs - can you try prefixing your command with
I will try to look at this tomorrow, but am also trying to get some vLLM changes into the 2.9.1 pytorch release so may not be able to get to it; will update after investigating on my end |
All good. I'm just documenting it here. I'll also have a look when I have time later this or next week. |
|
I also wonder if the FP8 extension could be contributing to this overhead? I haven't looked much into how this quantization interplays with compile |
|
Running a warmed up (run benchmark twice, take the second one) model, I got: vs I think this supports my idea the current integration may have some recompile happening first. I didn't observe the same size issues but couldn't run the command you provided on main so had to reduce my seq_len size to fit on my machine. Will investigate to see about recompiles with tlparse |
|
So I tried running |
Purpose
This is a followup to #23207 and adds
torch.compilesupport to Qwen3VL. I'm keeping it as a draft PR until I had time to run some benchmarks and correctness tests later this week./cc @Lucaskabela
Test Plan
Test Result