B200(sm=100a) FP8 accumulator bits

Recently we get the B200 and test the "tcgen05.mma.cta_group::1.kind::f8f6f4". We find the accumulator maintain 25bits mantissa, higher compared to H100 (13bit mantissa).
1. we want to confirm our findings of 25bits is reliable?
2. if more mantissa bits are reserved, does the deepgemm still calculate a group of 128 in tensor core and then move to accumulate in cuda core?
3. we also test the "tcgen05.mma.cta_group::1.kind::mxf4nvf4" and "tcgen05.mma.cta_group::1.kind::mxf4", but the number of mantissa bits in accumulator is not sure, 34,35,36,37bits  are tested.Do you ever conduct the test or have some reference? 

<img width="868" height="832" alt="Image" src="https://github.com/user-attachments/assets/53f60260-d782-4957-bba6-25ebc8124f42" />

Waiting for your reply and suggestion. Thank you a lot~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

B200(sm=100a) FP8 accumulator bits #176

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

B200(sm=100a) FP8 accumulator bits #176

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions