First look at mfma / wmma unification #2704

cgmillette · 2025-08-18T20:02:22Z

Proposed changes

This is a first look at a potential way we can unify the mfma and wmma backends together in ck_tile. The PR is in draft, so there is lots left to do, and this likely won't compile because it isn't complete.

The idea here is to isolate code into singular areas of responsibility and create a seamless interface for mma, regardless of the architecture you are compiling for. We'd like to be rid of the #ifdefs as much as we can to clean up the code.

This should be a forum to analyze the design, and if we decide to go forward with it, then we can go ahead and iron out the compilation and complete the full set of wrappers for the mfma and wmma backends. Please feel free to discuss your thoughts. We can massage any part of this code to fit with different ideas and workflow.

This type of design isn't solely intended for mfma / wmma functionality, but also can be easily applied to other __builtins, such as the cross-lane operations in DPP, swizzle, permute, etc. Extendability to new architectures becomes trivial as we need only to implement the backend-code and adjust selectors which forward to the generic interface.

What I wanted to do was put the initial design out there, to show the hierarchy of code as follows:

include/core/arch/wmma contains backend wrappers for wmma builtins, selectors, traits and transforms
include/core/arch/mfma contains the backend wrappers for mfma builtins, selectors, traits and transforms.
include/core/arch/mma*.hpp contain the front-end wrappers for generic Mma decomposition.

Selectors

These are meant to provide some automation for __builtin backend selection. For example on certain architectures, BlockK may be fixed at 32 and others at 16. The selector in the current submission will recursively divide the BlockK value until it finds an instruction that supports it (or doesn't, and returns a pass-through). Selectors can be implemented to select a variety of different parameters, including BlockMNK sizes, down from a larger given fragment size.

Traits

Traits classes take the Mfma/WmmaOp backend classes and provide their meta-data. We are able to encode template params, instruction traits and properties that are common to all Mma operations, and make them available anywhere at compile time. We are also able to simplify the unified interface, passing only the Mfma/WmmaOp backend class types as template parameters, and extracting further meta-data with traits classes as needed. This way the code is robust and maintenance of unified interfaces is very simple.

Transforms

Some architectures are very different than others, for example gfx11. It requires duplicated data in A/B inputs, and all accumulators are unpadded in b32 storage. Extra steps are sometimes required in the Mma workflow to account for and handle these differences. Introduction of transform selectors for certain architectures can automate such workflow changes and allow for a more generic top-level workflow where we needn't worry about low-level quirks such as this.

Generic Mma

The top-level Mma class is intended to provide a block-wise decomposition of an incoming fragment using 3 key inputs:

a selected __builtin backend
a set of pre-mma input transforms
a set of post-mma output transforms
By default, automated Selector classes will provide these inputs. However, the user can easily replace any of these with inputs of their choosing for their specific use-case. This ability to "hot swap" inputs gives us great testability and doesn't lock the code into using a specific selection framework. The design is flexible and straight-forward.

Additional Changes

Added extra namespace scoping to not pollute ck_tile namespace with implementation details
Added compiler awareness of "current target" such that we may only expose backends relevant to the target.
Added mfma / wmma implementations with padded K dimensions to support dims smaller than 16. This may be needed sometimes during partial tile decomposition.
Separated mfma / wmma into arch families to be able to give context to backend implementation.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…or HOST to 1. Update datatypes naming consistency

cgmillette added noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. ci:docs-only Skip most non-doc CI for this PR WIP labels Aug 18, 2025

First look at mfma / wmma unification

909403a

cgmillette force-pushed the unify_mfma_wmma branch from 6f7e264 to 909403a Compare August 19, 2025 09:25

cgmillette added 6 commits October 22, 2025 23:43

Refactor

512ccc1

Re-org file structure

dea59a2

Restructure transform selection and WaveWiseMma class

8af880c

Update license files. Add missing gfx1151 support. Change wave size f…

82405e0

…or HOST to 1. Update datatypes naming consistency

Fixes default MmaSelector implentation

c290e1c

Adds unit tests for amdgcn_mma and arch

ba2c5a2

cgmillette marked this pull request as ready for review October 29, 2025 23:42

cgmillette requested review from ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners October 29, 2025 23:42

cgmillette removed noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. ci:docs-only Skip most non-doc CI for this PR labels Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

First look at mfma / wmma unification #2704

First look at mfma / wmma unification #2704

cgmillette commented Aug 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

First look at mfma / wmma unification #2704

Are you sure you want to change the base?

First look at mfma / wmma unification #2704

Conversation

cgmillette commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Selectors

Traits

Transforms

Generic Mma

Additional Changes

Checklist

Discussion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cgmillette commented Aug 18, 2025 •

edited

Loading