Skip to content

Perf regression since sha f85ec6 #195

@hj-mistral

Description

@hj-mistral

Hi team, I am investigating a regression in performance for m_grouped_gemm_fp8_fp8_bf16_nt_masked. Benchmarking script to repro: https://gist.github.com/hj-mistral/d38801ce8e35860a7faba1e1688546cc.

Env

GPU: H200
CUDA: 12.9

Script output

On sha 79f48ee (current head)

Average time per iteration: 26.55 us
Bandwidth: 1030.01 GB/s

On sha ea9c5d9

Average time per iteration: 26.70 us
Bandwidth: 1024.26 GB/s

On sha 3254b75

Average time per iteration: 20.16 us
Bandwidth: 1356.20 GB/s

Can you confirm?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions