Skip to content

Conversation

@cgmillette
Copy link
Collaborator

@cgmillette cgmillette commented Aug 18, 2025

Proposed changes

This is a first look at a potential way we can unify the mfma and wmma backends together in ck_tile. The PR is in draft, so there is lots left to do, and this likely won't compile because it isn't complete.

The idea here is to isolate code into singular areas of responsibility and create a seamless interface for mma, regardless of the architecture you are compiling for. We'd like to be rid of the #ifdefs as much as we can to clean up the code.

This should be a forum to analyze the design, and if we decide to go forward with it, then we can go ahead and iron out the compilation and complete the full set of wrappers for the mfma and wmma backends. Please feel free to discuss your thoughts. We can massage any part of this code to fit with different ideas and workflow.

This type of design isn't solely intended for mfma / wmma functionality, but also can be easily applied to other __builtins, such as the cross-lane operations in DPP, swizzle, permute, etc. Extendability to new architectures becomes trivial as we need only to implement the backend-code and adjust selectors which forward to the generic interface.

What I wanted to do was put the initial design out there, to show the hierarchy of code as follows:

  • include/core/arch/wmma contains backend wrappers for wmma builtins, selectors, traits and transforms
  • include/core/arch/mfma contains the backend wrappers for mfma builtins, selectors, traits and transforms.
  • include/core/arch/mma*.hpp contain the front-end wrappers for generic Mma decomposition.

Selectors

These are meant to provide some automation for __builtin backend selection. For example on certain architectures, BlockK may be fixed at 32 and others at 16. The selector in the current submission will recursively divide the BlockK value until it finds an instruction that supports it (or doesn't, and returns a pass-through). Selectors can be implemented to select a variety of different parameters, including BlockMNK sizes, down from a larger given fragment size.

Traits

Traits classes take the Mfma/WmmaOp backend classes and provide their meta-data. We are able to encode template params, instruction traits and properties that are common to all Mma operations, and make them available anywhere at compile time. We are also able to simplify the unified interface, passing only the Mfma/WmmaOp backend class types as template parameters, and extracting further meta-data with traits classes as needed. This way the code is robust and maintenance of unified interfaces is very simple.

Transforms

Some architectures are very different than others, for example gfx11. It requires duplicated data in A/B inputs, and all accumulators are unpadded in b32 storage. Extra steps are sometimes required in the Mma workflow to account for and handle these differences. Introduction of transform selectors for certain architectures can automate such workflow changes and allow for a more generic top-level workflow where we needn't worry about low-level quirks such as this.

Generic Mma

The top-level Mma class is intended to provide a block-wise decomposition of an incoming fragment using 3 key inputs:

  • a selected __builtin backend
  • a set of pre-mma input transforms
  • a set of post-mma output transforms
    By default, automated Selector classes will provide these inputs. However, the user can easily replace any of these with inputs of their choosing for their specific use-case. This ability to "hot swap" inputs gives us great testability and doesn't lock the code into using a specific selection framework. The design is flexible and straight-forward.

Additional Changes

  • Added extra namespace scoping to not pollute ck_tile namespace with implementation details
  • Added compiler awareness of "current target" such that we may only expose backends relevant to the target.
  • Added mfma / wmma implementations with padded K dimensions to support dims smaller than 16. This may be needed sometimes during partial tile decomposition.
  • Separated mfma / wmma into arch families to be able to give context to backend implementation.
Unified_wmma_mfma (1)

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@cgmillette cgmillette added noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. ci:docs-only Skip most non-doc CI for this PR WIP labels Aug 18, 2025
@cgmillette cgmillette marked this pull request as ready for review October 29, 2025 23:42
@cgmillette cgmillette removed noCI Disable testing on supported CI systems: math libraries CI has this feature enabled.. ci:docs-only Skip most non-doc CI for this PR labels Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant