Skip to content

Conversation

@mxinO
Copy link

@mxinO mxinO commented Oct 30, 2025

What does this PR do?

Type of change: Bug fix

Overview:
Fix or improve the vllm PTQ.

  1. Now support ray, and can run on multiple nodes.
  2. MoE typo, and better folding weight for large MoE layers.
  3. Add the layer SharedFusedMoE
  4. Support vllm > 0.11 (not released yet)
  5. Add os env to specify quant configs

Usage

Testing

Tested with latest vllm.

Additional Information

The vllm >0.11.0 changed the low-level API significantly. Some changes needs to be removed when vllm<=0.11.0 is outdated.

@mxinO mxinO self-assigned this Oct 30, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 30, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mxinO mxinO changed the title [Draft] Fix/Improve vllm PTQ Fix/Improve vllm PTQ, and support latest vllm Nov 4, 2025
@mxinO mxinO changed the title Fix/Improve vllm PTQ, and support latest vllm Fix/Improve vllm PTQ and Support multi-node with ray Nov 4, 2025
@mxinO mxinO marked this pull request as ready for review November 4, 2025 06:07
@mxinO mxinO requested review from a team as code owners November 4, 2025 06:07
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinO does this maintain the support for non-ray + vLLM ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it still works for non-ray.

Comment on lines +186 to +206
model.load_state_dict(current_state_dict)
torch.distributed.barrier()

if amax_file_path is None:
# Sync amax across TP can be done here if needed
pass
# for name, buffer in model.named_buffers():
# if name.endswith("_amax"):
# print("syncing amax across TP for", name)
# torch.distributed.all_reduce(
# buffer, op=torch.distributed.ReduceOp.MAX, group=get_tp_group().device_group
# )
# torch.distributed.barrier()

if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
mtq.print_quant_summary(model)

mtq.fold_weight(model)
for name, module in model.named_modules():
if name.endswith("weight_quantizer"):
assert not module.is_enabled, f"quantizer {name} is still enabled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this under disable_compilation context?

Copy link
Author

@mxinO mxinO Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find issue here without disable_compilation.

mxinO and others added 17 commits November 10, 2025 22:15
Signed-off-by: mxin <[email protected]>
Signed-off-by: mxin <[email protected]>
Signed-off-by: mxin <[email protected]>
Signed-off-by: mxin <[email protected]>
Signed-off-by: mxin <[email protected]>
Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: mxin <[email protected]>
Signed-off-by: Kinjal Patel <[email protected]>
Signed-off-by: mxin <[email protected]>
Signed-off-by: noeyy-mino <[email protected]>
Signed-off-by: mxin <[email protected]>
kevalmorabia97 and others added 4 commits November 10, 2025 22:15
- Allow wheel build and release manual without depending on test status
(sometimes nmm-sandbox tests fail because of unavailable slurm machines)

Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: mxin <[email protected]>
…VILA (#525)

## What does this PR do?

**Type of change:** ? <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->
Bug fix

**Overview:** ?
Prompt user to manually install correct transformers version for VILA

## Usage
<!-- You can potentially add a usage example below. -->

```python
# Add a code snippet demonstrating how to use this
```

## Testing
<!-- Mention how have you tested your change if applicable. -->
```
CUDA_VISIBLE_DEVICES=0 bash -e scripts/huggingface_example.sh --model /models/VILA1.5-3b --quant fp8 --tp 1 --pp 1 --trust_remote_code --kv_cache_free_gpu_memory_fraction 0.5
```

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->

Signed-off-by: Yue <[email protected]>
Signed-off-by: mxin <[email protected]>
## What does this PR do?

**Type of change:** Bug fix

**Overview:** Ensure nodes are topologically sorted in ONNX graph.

## Usage

```python
python -m modelopt.onnx.quantization --onnx_path=$MODEL_NAME.onnx
```

## Testing
See bug 5591945 (model 4) and 5589019@13.

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes
- **Did you write any new necessary tests?**: No
- **Did you add or update any necessary documentation?**: No
- **Did you update
[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No

---------

Signed-off-by: gcunhase <[email protected]>
Signed-off-by: mxin <[email protected]>
## What does this PR do?

**Type of change:** Improve existing feature <!-- Use one of the
following: Bug fix, new feature, new example, new tests, documentation.
-->

**Overview:** GPT-OSS model has Yarn RoPE which adds additional
nn.Embedding modules that need to be enabled in DynamicModule for
Minitron pruning

## Testing
<!-- Mention how have you tested your change if applicable. -->

- gpt-oss-20b pruned using M-LM pruning example and conf scripts.

Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: mxin <[email protected]>
@mxinO
Copy link
Author

mxinO commented Nov 11, 2025

Sorry, messed up the sign-offs

@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 70.41420% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.37%. Comparing base (f2eb794) to head (dff4960).
⚠️ Report is 37 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/onnx/quantization/ort_patching.py 0.00% 24 Missing ⚠️
modelopt/torch/_deploy/utils/torch_onnx.py 25.00% 9 Missing ⚠️
modelopt/torch/quantization/utils.py 25.00% 6 Missing ⚠️
modelopt/onnx/quantization/qdq_utils.py 80.00% 4 Missing ⚠️
modelopt/onnx/autocast/graphsanitizer.py 72.72% 3 Missing ⚠️
modelopt/onnx/quantization/fp8.py 0.00% 2 Missing ⚠️
modelopt/onnx/trt_utils.py 71.42% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #484      +/-   ##
==========================================
+ Coverage   73.39%   74.37%   +0.98%     
==========================================
  Files         180      182       +2     
  Lines       18138    18219      +81     
==========================================
+ Hits        13312    13550     +238     
+ Misses       4826     4669     -157     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.