To reproduce:
Command: ./bin/ckProfiler gemm 2 1 1 2 0 1 32 512 7168 -1 -1 -1 3 100
GPU Type: MI300x
Searched Perf: Best Perf for datatype = bf16 ALayout = RowMajor BLayout = ColumnMajor M = 32 N = 512 K = 7168 StrideA = 7168 StrideB = 7168 StrideC = 512 : 0.0634091 ms, 3.70421 TFlops, 123.508 GB/s, DeviceGemm_Xdl_CShuffle<Default, 64, 32, 64, 32, 8, 8, 32, 32, 1, 2, 8, 8, 1, 1> LoopScheduler: Default, PipelineVersion: v2