Skip to content

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Oct 7, 2025

Stacked PRs:


[mxfp8 moe training] integrate mxfp8 dim0 triton kernel

Test plan

  • pytest test/prototype/moe_training/test_scaled_grouped_mm.py -k dq -s

Benchmarks

M,N,K,G                  recipe                  bf16_fwd_bwd_us    scaled_fwd_bwd_us  scaled_fwd_bwd_speedup      bf16_fwd_us    scaled_fwd_us  scaled_fwd_speedup
-----------------------  --------------------  -----------------  -------------------  ------------------------  -------------  ---------------  --------------------
(16384, 8192, 5120, 1)   MoEScalingType.MXFP8           4239.74              2978.88   1.423x                         1229.87           758.8    1.621x
(16384, 8192, 5120, 2)   MoEScalingType.MXFP8           4192.19              3381.7    1.24x                          1229.5           1079.33   1.139x
(16384, 8192, 5120, 4)   MoEScalingType.MXFP8           3920.91              3419.1    1.147x                         1093.42           820.416  1.333x
(16384, 8192, 5120, 8)   MoEScalingType.MXFP8           4309.06              3633.54   1.186x                         1093.73           932.128  1.173x
(128000, 8192, 5120, 1)  MoEScalingType.MXFP8          50533.1              23270.9    2.172x                        12149.6           6208.59   1.957x
(128000, 8192, 5120, 2)  MoEScalingType.MXFP8          57250.8              23629.6    2.423x                        10176.2           6408.26   1.588x
(128000, 8192, 5120, 4)  MoEScalingType.MXFP8          35872.8              25179.2    1.425x                        10041.7           5813.94   1.727x
(128000, 8192, 5120, 8)  MoEScalingType.MXFP8          50138.8              23592      2.125x                        18598.9           6110.18   3.044x
(16384, 1536, 5120, 1)   MoEScalingType.MXFP8            808                  987.136  0.819x                          246.816          261.12   0.945x
(16384, 1536, 5120, 2)   MoEScalingType.MXFP8            855.072              914.56   0.935x                          224.496          263.2    0.853x
(16384, 1536, 5120, 4)   MoEScalingType.MXFP8            824.4               1034.11   0.797x                          287.744          273.28   1.053x
(16384, 1536, 5120, 8)   MoEScalingType.MXFP8            847.968             1033.44   0.821x                          220.384          283.712  0.777x
(128000, 1536, 5120, 1)  MoEScalingType.MXFP8           6480.8               7623.65   0.85x                          2100.29          2025.28   1.037x
(128000, 1536, 5120, 2)  MoEScalingType.MXFP8           6530.54              7277.41   0.897x                         2112.32          1929.28   1.095x
(128000, 1536, 5120, 4)  MoEScalingType.MXFP8           7770.05              6168.45   1.26x                          2020.35          1638.43   1.233x
(128000, 1536, 5120, 8)  MoEScalingType.MXFP8           7438.1               6244.24   1.191x                         1847.2           1786.94   1.034x
(16384, 2048, 7168, 1)   MoEScalingType.MXFP8           1739.78              1519.78   1.145x                          452.512          392.224  1.154x
(16384, 2048, 7168, 2)   MoEScalingType.MXFP8           1628.64              1522.7    1.07x                           468              402.432  1.163x
(16384, 2048, 7168, 4)   MoEScalingType.MXFP8           1564.16              1437.3    1.088x                          398.272          392.448  1.015x
(16384, 2048, 7168, 8)   MoEScalingType.MXFP8           1478.3               1647.55   0.897x                          416.8            420.032  0.992x
(128000, 2048, 7168, 1)  MoEScalingType.MXFP8          13811.2              11483.7    1.203x                         3793.09          3032.96   1.251x
(128000, 2048, 7168, 2)  MoEScalingType.MXFP8          12086.2              11340.2    1.066x                         3795.1           3009.82   1.261x
(128000, 2048, 7168, 4)  MoEScalingType.MXFP8          12410.9              10389.6    1.195x                         3529.3           2807.25   1.257x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8          14126                 9803.76   1.441x                         3377.52          2585.7    1.306x

stack-info: PR: #3129, branch: danielvegamyhre/stack/76
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/76 branch from 51b9be2 to 168d4b7 Compare October 7, 2025 17:58
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 7, 2025
Copy link

pytorch-bot bot commented Oct 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3129

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

pytorch-bot bot commented Oct 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3129

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure

As of commit 168d4b7 with merge base cd21d0e (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/75 to main October 7, 2025 20:36
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/75 October 7, 2025 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant