First look at mfma / wmma unification #2704
Open
+2,896
−11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed changes
This is a first look at a potential way we can unify the mfma and wmma backends together in ck_tile. The PR is in draft, so there is lots left to do, and this likely won't compile because it isn't complete.
The idea here is to isolate code into singular areas of responsibility and create a seamless interface for mma, regardless of the architecture you are compiling for. We'd like to be rid of the #ifdefs as much as we can to clean up the code.
This should be a forum to analyze the design, and if we decide to go forward with it, then we can go ahead and iron out the compilation and complete the full set of wrappers for the mfma and wmma backends. Please feel free to discuss your thoughts. We can massage any part of this code to fit with different ideas and workflow.
This type of design isn't solely intended for mfma / wmma functionality, but also can be easily applied to other __builtins, such as the cross-lane operations in DPP, swizzle, permute, etc. Extendability to new architectures becomes trivial as we need only to implement the backend-code and adjust selectors which forward to the generic interface.
What I wanted to do was put the initial design out there, to show the hierarchy of code as follows:
Selectors
These are meant to provide some automation for __builtin backend selection. For example on certain architectures, BlockK may be fixed at 32 and others at 16. The selector in the current submission will recursively divide the BlockK value until it finds an instruction that supports it (or doesn't, and returns a pass-through). Selectors can be implemented to select a variety of different parameters, including BlockMNK sizes, down from a larger given fragment size.
Traits
Traits classes take the Mfma/WmmaOp backend classes and provide their meta-data. We are able to encode template params, instruction traits and properties that are common to all Mma operations, and make them available anywhere at compile time. We are also able to simplify the unified interface, passing only the Mfma/WmmaOp backend class types as template parameters, and extracting further meta-data with traits classes as needed. This way the code is robust and maintenance of unified interfaces is very simple.
Transforms
Some architectures are very different than others, for example gfx11. It requires duplicated data in A/B inputs, and all accumulators are unpadded in b32 storage. Extra steps are sometimes required in the Mma workflow to account for and handle these differences. Introduction of transform selectors for certain architectures can automate such workflow changes and allow for a more generic top-level workflow where we needn't worry about low-level quirks such as this.
Generic Mma
The top-level Mma class is intended to provide a block-wise decomposition of an incoming fragment using 3 key inputs:
By default, automated Selector classes will provide these inputs. However, the user can easily replace any of these with inputs of their choosing for their specific use-case. This ability to "hot swap" inputs gives us great testability and doesn't lock the code into using a specific selection framework. The design is flexible and straight-forward.
Additional Changes
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered