You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add MoE (e.g. Qwen3-30B-A3B, Mamba hybrid) pruning support in Minitron (#467)
**Type of change:** New feature <!-- Use one of the following: Bug fix,
new feature, new example, new tests, documentation. -->
- Support pruning `num_moe_experts`, `moe_ffn_hidden_size`, and
`moe_shared_expert_intermediate_size` in `mcore_minitron` pruning
<!-- Mention how have you tested your change if applicable. -->
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes <!--- If No, explain why.
-->
- **Did you write any new necessary tests?**: Yes
- **Did you add or update any necessary documentation?**: Yes
- **Did you update
[Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes <!--- Only for new features, API changes, critical bug fixes or bw
breaking changes. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
* **New Features**
* Added Mixture of Experts (MoE) pruning support with new configurable
dimensions for expert count and intermediate sizes
* Extended NAS architecture search capabilities to include MoE model
parameters
* **Documentation**
* Updated support matrix and pruning documentation for MoE-compatible
models
* Clarified available pruning dimensions and parameters for MoE
architectures
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Keval Morabia <[email protected]>
Co-authored-by: Keval Morabia <[email protected]>
Signed-off-by: Keval Morabia <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,6 +10,7 @@ Model Optimizer Changelog (Linux)
10
10
11
11
**New Features**
12
12
13
+
- Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for ``num_moe_experts``, ``moe_ffn_hidden_size`` and ``moe_shared_expert_intermediate_size`` parameters in Minitron pruning (``mcore_minitron``).
13
14
- Add ``specdec_bench`` example to benchmark speculative decoding performance. See `examples/specdec_bench/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/specdec_bench#speculative-decoding-benchmark>`_ for more details.
Checkout pruning [getting started section](../pruning/README.md#getting-started) and [guidelines](../pruning/README.md#pruning-guidelines) for configuring pruning parameters in the pruning README.
114
115
115
-
Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning options are:
116
+
Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning dimensions are:
116
117
117
118
-`TARGET_FFN_HIDDEN_SIZE`
118
119
-`TARGET_HIDDEN_SIZE`
119
120
-`TARGET_NUM_ATTENTION_HEADS`
120
121
-`TARGET_NUM_QUERY_GROUPS`
121
122
-`TARGET_MAMBA_NUM_HEADS`
122
123
-`TARGET_MAMBA_HEAD_DIM`
124
+
-`TARGET_NUM_MOE_EXPERTS`
125
+
-`TARGET_MOE_FFN_HIDDEN_SIZE`
126
+
-`TARGET_MOE_SHARED_EXPERT_INTERMEDIATE_SIZE`
123
127
-`TARGET_NUM_LAYERS`
124
128
-`LAYERS_TO_DROP` (comma separated, 1-indexed list of layer numbers to directly drop)
Copy file name to clipboardExpand all lines: examples/pruning/README.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ Pruning can involve removal (prune) of Linear and Conv layers, and Transformer a
6
6
7
7
This section focuses on applying Model Optimizer's state-of-the-art complementary pruning modes to enable you to search for the best subnet architecture from your provided base model:
8
8
9
-
1.[Minitron](https://arxiv.org/pdf/2408.11796): A pruning method developed by NVIDIA Research for pruning GPT, Mambaand Hybrid Transformer Mamba models in NVIDIA NeMo or Megatron-LM framework. It uses the activation magnitudes to prune the embedding hidden size, mlp ffn hidden size, transformer attention heads, GQA query groups, mamba heads and head dimension, and number of layers of the model.
9
+
1.[Minitron](https://arxiv.org/pdf/2408.11796): A pruning method developed by NVIDIA Research for pruning GPT (and later extended to Mamba, MoE, and Hybrid Transformer Mamba) models in NVIDIA Megatron-LM or NeMo framework. It uses the activation magnitudes to prune the embedding hidden size; mlp ffn hidden size; transformer attention heads and GQA query groups; mamba heads and head dimension; MoE number of experts, ffn hidden size, and shared expert intermediate size; and number of layers of the model.
10
10
1. FastNAS: A pruning method recommended for Computer Vision models. Given a pretrained model, FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
11
11
1. GradNAS: A light-weight pruning method recommended for language models like Hugging Face BERT, GPT-J. It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.
12
12
@@ -89,11 +89,11 @@ If your model parameters are already sorted, you can skip the sorting step by se
> *<sup>1.</sup>Only Pipeline Parallel models are supported. Hugging Face models can be converted to NeMo format and used subsequently.*
96
+
> *<sup>1.</sup>Only Pipeline Parallel models are supported. Hugging Face models can be converted to Megatron-LM/NeMo format and used subsequently.*
97
97
98
98
## Pruning Guidelines
99
99
@@ -122,7 +122,7 @@ Depth pruning reduces the number of layers (`num_layers`) in the model.
122
122
123
123
#### Width Pruning
124
124
125
-
Width pruning reduces model dimensions per layer such as `hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `num_query_groups`, `mamba_num_heads`, and `mamba_head_dim`.
125
+
Width pruning reduces model dimensions per layer such as `hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `num_query_groups`, `mamba_num_heads`, `mamba_head_dim`, `num_moe_experts`, `moe_ffn_hidden_size`, and `moe_shared_expert_intermediate_size`.
0 commit comments