-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Model] Add MoE support for NemotronH #25863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…edReLu activation - adapt the FusedMoE object to support is_act_and_mul=False Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
…s an attribute in FusedMoE Signed-off-by: Tomer Asida <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for a non-gated Squared ReLU MoE module in the NemotronH architecture, which is a valuable enhancement. The changes are mostly well-implemented across the fused MoE layers and model definition. However, I've identified a critical bug in the forward pass of the new NemotronHMoE
module related to incorrect floating-point computation and a potential UnboundLocalError
. I've provided a detailed comment with a suggested fix for this issue. Addressing this is crucial for the correctness of the model's output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the reviewer(s)
NemotronHForCausalLM
now optionally has an MoE block. I was wondering if it should implement the MixtureOfExperts
interface or not. Do you have any guidance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to something similar to this PR #25311 (comment), where is_mixture_of_experts
depends on an attribute of the model. I don't know all the cases where this is used though
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
…xperts Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
bf2285e
to
7fff9a8
Compare
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
if not self.moe_config.is_act_and_mul: | ||
# Avoid circular import | ||
from vllm.model_executor.layers.quantization.modelopt import ( | ||
ModelOptFp8MoEMethod, | ||
) | ||
|
||
if not isinstance( | ||
quant_method, (UnquantizedFusedMoEMethod, ModelOptFp8MoEMethod) | ||
): | ||
raise NotImplementedError( | ||
"is_act_and_mul=False is supported only for unquantized " | ||
"and ModelOpt FP8 moe for now" | ||
) | ||
if not current_platform.is_cuda(): | ||
raise NotImplementedError( | ||
"is_act_and_mul=False is supported only for CUDA for now" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the blockers for supporting is_act_and_mul = False
more generally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating the relevant kernels :) We plan to follow up with that
if ( | ||
envs.VLLM_USE_FLASHINFER_MOE_FP8 | ||
and has_flashinfer_moe() | ||
and self.moe.is_act_and_mul | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For NemotronH, self.flashinfer_moe_backend
will end up being None
. What implementation ends up getting used in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
triton kernels. This is currently the only code path available with is_act_and_mul=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is going to be very complicated to add to all the quant and kernel backends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to something similar to this PR #25311 (comment), where is_mixture_of_experts
depends on an attribute of the model. I don't know all the cases where this is used though
num_redundant_experts=self.n_redundant_experts, | ||
) | ||
|
||
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DP+TP cases, we should use the sequence parallel trick like in #24982 to avoid duplicate work in the expert layers
Purpose
Add support for an MoE module in the NemotronH architecture.
This MoE module is relatively unique (to the best of my knowledge, comparable only to nomic-ai/nomic-embed-text-v2-moe), as it uses a non-gated Squared ReLU activation function.
In this PR:
NemotronHMoE
module to the NemotronH modeling fileFusedMoE
class (in addition to by calling thefused_moe
function directly)ModelOptFp8MoEMethod
quant_method, currently only in the triton path