[ort_fusuion] Support fp16 in rms_norm fusion (#2491)

titaiwangms · web-flow · commit 700bb1a4b4f8 · 2025-08-14T17:08:20.000-07:00
In RMSNorm, there are compute_type and target_type, which we run the computation on compute_type and then convert it back to target_type after RMSNorm. Typical example can be found in RMSNorm class in LLMs, like in GPT-OSS: https://github.com/huggingface/transformers/blob/52c6c1bb6e27ca87c4faede34a4c2a7404c17c4d/src/transformers/models/gpt_oss/modeling_gpt_oss.py#L54 Therefore, we need to take op.Cast into pattern consideration.
diff --git a/onnxscript/rewriter/ort_fusions/rms_normalization.py b/onnxscript/rewriter/ort_fusions/rms_normalization.py
@@ -40,6 +40,8 @@ def pattern(self, op, x, scale, epsilon, compute_dtype, target_dtype):
         reciprocal_rms = op.Reciprocal(rms)
         normalized = op.Mul(x, reciprocal_rms)
         normalized = pattern.OrValue([op.Cast(normalized, to=target_dtype), normalized])
+        # To support float16, we need to ensure the scale is casted or not.
+        scale = pattern.OrValue([op.Cast(scale, to=compute_dtype), scale])
         return op.Mul(scale, normalized)
 
     def check(