Fix/Improve vllm PTQ and Support multi-node with ray #484

mxinO · 2025-10-30T05:19:50Z

What does this PR do?

Type of change: Bug fix

Overview:
Fix or improve the vllm PTQ.

Now support ray, and can run on multiple nodes.
MoE typo, and better folding weight for large MoE layers.
Add the layer SharedFusedMoE
Support vllm > 0.11 (not released yet)
Add os env to specify quant configs

Usage

Testing

Tested with latest vllm.

Additional Information

The vllm >0.11.0 changed the low-level API significantly. Some changes needs to be removed when vllm<=0.11.0 is outdated.

copy-pr-bot · 2025-10-30T05:19:53Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

realAsma · 2025-11-04T14:42:01Z

examples/vllm_serve/vllm_serve_fakequant.py

@mxinO does this maintain the support for non-ray + vLLM ?

Sure, it still works for non-ray.

realAsma · 2025-11-04T14:44:20Z

examples/vllm_serve/fakequant_worker.py

+        model.load_state_dict(current_state_dict)
+        torch.distributed.barrier()
+
+    if amax_file_path is None:
+        # Sync amax across TP can be done here if needed
+        pass
+        # for name, buffer in model.named_buffers():
+        #     if name.endswith("_amax"):
+        #         print("syncing amax across TP for", name)
+        #         torch.distributed.all_reduce(
+        #             buffer, op=torch.distributed.ReduceOp.MAX, group=get_tp_group().device_group
+        #         )
+        # torch.distributed.barrier()
+
+    if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
+        mtq.print_quant_summary(model)
+
+    mtq.fold_weight(model)
+    for name, module in model.named_modules():
+        if name.endswith("weight_quantizer"):
+            assert not module.is_enabled, f"quantizer {name} is still enabled"


Do we need to do this under disable_compilation context?

I didn't find issue here without disable_compilation.

Signed-off-by: mxin <[email protected]>

Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: mxin <[email protected]>

Signed-off-by: unknown <[email protected]> Signed-off-by: mxin <[email protected]>

Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: mxin <[email protected]>

Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: mxin <[email protected]>

Signed-off-by: gcunhase <[email protected]> Signed-off-by: mxin <[email protected]>

Signed-off-by: Kinjal Patel <[email protected]> Signed-off-by: mxin <[email protected]>

Signed-off-by: noeyy-mino <[email protected]> Signed-off-by: mxin <[email protected]>

#479) Signed-off-by: Shengliang Xu <[email protected]> Signed-off-by: mxin <[email protected]>

- Allow wheel build and release manual without depending on test status (sometimes nmm-sandbox tests fail because of unavailable slurm machines) Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: mxin <[email protected]>

…VILA (#525) ## What does this PR do? **Type of change:** ?  Bug fix **Overview:** ? Prompt user to manually install correct transformers version for VILA ## Usage  ```python # Add a code snippet demonstrating how to use this ``` ## Testing  ``` CUDA_VISIBLE_DEVICES=0 bash -e scripts/huggingface_example.sh --model /models/VILA1.5-3b --quant fp8 --tp 1 --pp 1 --trust_remote_code --kv_cache_free_gpu_memory_fraction 0.5 ``` ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information  Signed-off-by: Yue <[email protected]> Signed-off-by: mxin <[email protected]>

## What does this PR do? **Type of change:** Bug fix **Overview:** Ensure nodes are topologically sorted in ONNX graph. ## Usage ```python python -m modelopt.onnx.quantization --onnx_path=$MODEL_NAME.onnx ``` ## Testing See bug 5591945 (model 4) and 5589019@13. ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: No --------- Signed-off-by: gcunhase <[email protected]> Signed-off-by: mxin <[email protected]>

## What does this PR do? **Type of change:** Improve existing feature  **Overview:** GPT-OSS model has Yarn RoPE which adds additional nn.Embedding modules that need to be enabled in DynamicModule for Minitron pruning ## Testing  - gpt-oss-20b pruned using M-LM pruning example and conf scripts. Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: mxin <[email protected]>

mxinO · 2025-11-11T06:18:06Z

Sorry, messed up the sign-offs

codecov · 2025-11-11T06:30:15Z

Codecov Report

❌ Patch coverage is 70.41420% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.37%. Comparing base (f2eb794) to head (dff4960).
⚠️ Report is 37 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/onnx/quantization/ort_patching.py	0.00%	24 Missing ⚠️
modelopt/torch/_deploy/utils/torch_onnx.py	25.00%	9 Missing ⚠️
modelopt/torch/quantization/utils.py	25.00%	6 Missing ⚠️
modelopt/onnx/quantization/qdq_utils.py	80.00%	4 Missing ⚠️
modelopt/onnx/autocast/graphsanitizer.py	72.72%	3 Missing ⚠️
modelopt/onnx/quantization/fp8.py	0.00%	2 Missing ⚠️
modelopt/onnx/trt_utils.py	71.42%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #484      +/-   ##
==========================================
+ Coverage   73.39%   74.37%   +0.98%     
==========================================
  Files         180      182       +2     
  Lines       18138    18219      +81     
==========================================
+ Hits        13312    13550     +238     
+ Misses       4826     4669     -157

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mxinO self-assigned this Oct 30, 2025

mxinO changed the title ~~[Draft] Fix/Improve vllm PTQ~~ Fix/Improve vllm PTQ, and support latest vllm Nov 4, 2025

mxinO changed the title ~~Fix/Improve vllm PTQ, and support latest vllm~~ Fix/Improve vllm PTQ and Support multi-node with ray Nov 4, 2025

mxinO marked this pull request as ready for review November 4, 2025 06:07

mxinO requested review from a team as code owners November 4, 2025 06:07

mxinO requested review from Edwardf0t1, RalphMao, kinjalpatel27 and realAsma November 4, 2025 06:07

realAsma reviewed Nov 4, 2025

View reviewed changes

mxinO and others added 17 commits November 10, 2025 22:15

vllm fix

8977201

Signed-off-by: mxin <[email protected]>

minor

45b6302

Signed-off-by: mxin <[email protected]>

support multiple version

b3aca52

Signed-off-by: mxin <[email protected]>

cuda graph

6bfcd51

Signed-off-by: mxin <[email protected]>

support multiple version

c77545d

Signed-off-by: mxin <[email protected]>

update doc

8314b05

Signed-off-by: mxin <[email protected]>

clean up

2ed37c7

Signed-off-by: mxin <[email protected]>

Rename examples/megatron-lm to examples/Megatron-LM (#481)

1e3113c

Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: mxin <[email protected]>

[4975376][5541172]perplexity and kl-divergence benchmark metrics (#411)

070193c

Signed-off-by: unknown <[email protected]> Signed-off-by: mxin <[email protected]>

Do not modify num calib data samples to batch boundary (#483)

136de00

Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: mxin <[email protected]>

Add GitHub action to close inactive Issues and PRs

1c6824d

Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: mxin <[email protected]>

Update Github issue templates

3d90abc

Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: mxin <[email protected]>

[5597780] Add support for FP16-only custom ops (#460)

fca00fd

Signed-off-by: gcunhase <[email protected]> Signed-off-by: mxin <[email protected]>

[5271050, 5274346][ONNX] Add support for Conv-Act-Pool fusion (#448)

fb1f723

Signed-off-by: gcunhase <[email protected]> Signed-off-by: mxin <[email protected]>

Disabled conv1d quantization (#495)

d85f491

Signed-off-by: Kinjal Patel <[email protected]> Signed-off-by: mxin <[email protected]>

update model_type of Qwen (#477)

346b82c

Signed-off-by: noeyy-mino <[email protected]> Signed-off-by: mxin <[email protected]>

[OMNIML-2917] export layer config using actual prefix instead of hard… (

8bbbff5

#479) Signed-off-by: Shengliang Xu <[email protected]> Signed-off-by: mxin <[email protected]>

kevalmorabia97 and others added 4 commits November 10, 2025 22:15

mxinO force-pushed the mxin/vllm_fix branch from ef23182 to dff4960 Compare November 11, 2025 06:16

mxinO requested review from a team as code owners November 11, 2025 06:16

mxinO requested review from AAnoosheh, ChenhanYu, cjluo-nv, gcunhase, yeyu-nvidia and zhanghaoc November 11, 2025 06:16

mxinO closed this Nov 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/Improve vllm PTQ and Support multi-node with ray #484

Fix/Improve vllm PTQ and Support multi-node with ray #484

Uh oh!

mxinO commented Oct 30, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

realAsma Nov 4, 2025

Uh oh!

mxinO Nov 11, 2025

Uh oh!

realAsma Nov 4, 2025

Uh oh!

mxinO Nov 11, 2025 •

edited

Loading

Uh oh!

mxinO commented Nov 11, 2025

Uh oh!

codecov bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

22 participants

Fix/Improve vllm PTQ and Support multi-node with ray #484

Fix/Improve vllm PTQ and Support multi-node with ray #484

Uh oh!

Conversation

mxinO commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Additional Information

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

realAsma Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

mxinO Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

realAsma Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

mxinO Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxinO commented Nov 11, 2025

Uh oh!

codecov bot commented Nov 11, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

22 participants

mxinO commented Oct 30, 2025 •

edited

Loading

mxinO Nov 11, 2025 •

edited

Loading