Merge OpenAI Triton commit `40dd0c4` #5466

whitneywhtsang · 2025-11-13T00:43:20Z

This PR changes the Triton base from 4734af3 to 40dd0c4 (Oct 24).
Pass rate: 94.59%->94.91%

- Enable cp.async.bulk.tensor.2d.tile::gather4.shared on sm_120 and sm_121. - Skip TMA scatter4 test on sm_120 since it is unsupported by hardware. Note: All other TMA features except for cluster-related ones are supported on sm_120.

This PR exposes the internal layout utility `chooseScaledMfmaScaleLayout` and `chooseScaledWmmaScaleLayout` for Gluon, to help generate a linear layout for scale used in `mfma_scaled`/`wmma_scaled`. This also allows gluon kernels to specify a scalar scale value or leave it as None.

Without resetting opt_flags, the following does not work and gives error `AssertionError: opt_flags already set; please reset to None first`: ``` import torch from triton_kernels.matmul_ogs import matmul_ogs, PrecisionConfig from triton_kernels.matmul_ogs_details.opt_flags import ( make_opt_flags, set_opt_flags, ) from triton_kernels.routing import RoutingData m = 64 n = 128 k = 32 BATCH_SIZE = 1000 dtype = torch.float16 x = torch.randn((BATCH_SIZE, m, k), device="cuda", dtype=dtype) w = torch.randn((BATCH_SIZE, k, n), device="cuda", dtype=dtype) bias = None opt_flags = make_opt_flags( dtype, dtype, dtype, PrecisionConfig(), m, n, k, RoutingData(None, None, BATCH_SIZE, 1), True, False, False, ) set_opt_flags(opt_flags) tri_y = matmul_ogs(x, w, bias) opt_flags.num_warps = 2 set_opt_flags(opt_flags) tri_y = matmul_ogs(x, w, bias) ``` After adding `reset_opt_flags()` before the second call of `set_opt_flags` everything works fine.

Functions and their individual arguments are passed as an array. All the arguments are just appended together in MLIR, but the `WarpSpecializeOp::canonicalize` method will clean up duplicate arguments.

This is in preparation for more examples to add and be consistent with other directory names.

…8531) `warp_specialize` ops currently have unknown location set in the TTGIR due to a quirk in the code emission in `_semantic.py`: for `warp_specialize` we need save and then restore insert point. Location is being inferred from the insert point, however if insert point happens to be in a place that doesn't have location assigned (end of a block), we set unknown loc. This change is a minimal fix that adds a helper that gets the location from block's parent in such a case. Alternatively we could also save location along with insert point, and then restore it accordingly. This approach is simpler and should help for most cases I could have think of however. This change is important for consan changes I am working on, as it breaks the LLVM backend if we create instrumentation function calls with unknown location inferred from warp_specialize op.

…c (#8529) During SWP, we are checking if a given `LoadOp` should be lowered to `AsyncCopyGlobalToLocalOp` twice - first in `AssignLatency`, and `LowerLoops` next. The two checks duplicate non-trivial conditions like `copyVecBytes >= 4` or `op.getResultTypes()[0].getIntOrFloatBitWidth() >= 32`. I moved the `isPipeliningBeneficial` function from `AssignLatency` into utilities so that it can also be used by `LowerLoops`. This will also be used by WS to determine if `LoadOp` should be lowered to cpasync and assigned to the load partition.

Expose `buffer_load` and `buffer_store`, inherited from CDNA3, to gfx1250.

# New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

…528) Each aggregate class tracks its callable members and when the aggregate is referenced by name, the cache keys of all its members are computed. This does require `def __init__` to be marked as `@constexpr_function`

Signed-off-by: Whitney Tsang <[email protected]>

ita9naiwa and others added 11 commits October 24, 2025 14:51

[Gluon] Change gl.warp_specialize API (#8527)

cbab5f4

Functions and their individual arguments are passed as an array. All the arguments are just appended together in MLIR, but the `WarpSpecializeOp::canonicalize` method will clean up duplicate arguments.

[AMD] NFC: rename Gluon example directory (#8530)

869733f

This is in preparation for more examples to add and be consistent with other directory names.

[AMD][GLUON] Expose buffer ops to gfx1250 (#8532)

11af53c

Expose `buffer_load` and `buffer_store`, inherited from CDNA3, to gfx1250.

Merge commit '40dd0c41758fc52ac1423c2e55332088fc865702'

46ceebb

whitneywhtsang self-assigned this Nov 13, 2025

whitneywhtsang requested a review from chengjunlu November 13, 2025 03:43

Fix AssertionError: Torch not compiled with CUDA enabled

d293395

Signed-off-by: Whitney Tsang <[email protected]>

chengjunlu approved these changes Nov 13, 2025

View reviewed changes

whitneywhtsang marked this pull request as ready for review November 13, 2025 17:31

whitneywhtsang merged commit 90692f1 into main Nov 13, 2025
59 of 65 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch November 13, 2025 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge OpenAI Triton commit `40dd0c4` #5466

Merge OpenAI Triton commit `40dd0c4` #5466

Uh oh!

whitneywhtsang commented Nov 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Merge OpenAI Triton commit 40dd0c4 #5466

Merge OpenAI Triton commit 40dd0c4 #5466

Uh oh!

Conversation

whitneywhtsang commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Merge OpenAI Triton commit `40dd0c4` #5466

Merge OpenAI Triton commit `40dd0c4` #5466

whitneywhtsang commented Nov 13, 2025 •

edited

Loading