-
Notifications
You must be signed in to change notification settings - Fork 709
Open
Description
Hi team, I am investigating a regression in performance for m_grouped_gemm_fp8_fp8_bf16_nt_masked
. Benchmarking script to repro: https://gist.github.com/hj-mistral/d38801ce8e35860a7faba1e1688546cc.
Env
GPU: H200
CUDA: 12.9
Script output
On sha 79f48ee (current head)
Average time per iteration: 26.55 us
Bandwidth: 1030.01 GB/s
On sha ea9c5d9
Average time per iteration: 26.70 us
Bandwidth: 1024.26 GB/s
On sha 3254b75
Average time per iteration: 20.16 us
Bandwidth: 1356.20 GB/s
Can you confirm?
Metadata
Metadata
Assignees
Labels
No labels