Skip to content

Conversation

banach-space
Copy link
Contributor

Motivation

Improve readability and stability of autogenerated CHECK lines by using
operation-aware FileCheck variable names instead of generic VAL_N.

What changes

  • When possible, variable names are derived from the MLIR op name, e.g.
    vector.transfer_readTRANSFER_READ_0.
  • Unknown ops (e.g., from out-of-tree dialects) fall back to the prior
    VAL_N scheme.

Before

  // CHECK: %[[VAL_4:.*]] = vector.transfer_read ...
  // CHECK: %[[VAL_5:.*]] = "val_use"(%[[VAL_4]]) : ...

After

  // CHECK: %[[TRANSFER_READ_0:.*]] = vector.transfer_read ...
  // CHECK: %[[VAL_1:.*]] = "val_use"(%[[TRANSFER_READ_0]]) : ...

Rationale

Using op-derived names (e.g., TRANSFER_READ_0) makes tests easier to
read and audit, while remaining more stable across unrelated edits (e.g.
there will always be fewer TRANSFER_READ_#N variables than VAL_#N).
The fallback to VAL_N preserves compatibility for unknown ops.

@llvmbot
Copy link
Member

llvmbot commented Sep 26, 2025

@llvm/pr-subscribers-mlir

Author: Andrzej Warzyński (banach-space)

Changes

Motivation

Improve readability and stability of autogenerated CHECK lines by using
operation-aware FileCheck variable names instead of generic VAL_N.

What changes

  • When possible, variable names are derived from the MLIR op name, e.g.
    vector.transfer_readTRANSFER_READ_0.
  • Unknown ops (e.g., from out-of-tree dialects) fall back to the prior
    VAL_N scheme.

Before

  // CHECK: %[[VAL_4:.*]] = vector.transfer_read ...
  // CHECK: %[[VAL_5:.*]] = "val_use"(%[[VAL_4]]) : ...

After

  // CHECK: %[[TRANSFER_READ_0:.*]] = vector.transfer_read ...
  // CHECK: %[[VAL_1:.*]] = "val_use"(%[[TRANSFER_READ_0]]) : ...

Rationale

Using op-derived names (e.g., TRANSFER_READ_0) makes tests easier to
read and audit, while remaining more stable across unrelated edits (e.g.
there will always be fewer TRANSFER_READ_#N variables than VAL_#N).
The fallback to VAL_N preserves compatibility for unknown ops.


Full diff: https://github.com/llvm/llvm-project/pull/160820.diff

1 Files Affected:

  • (modified) mlir/utils/generate-test-checks.py (+30-4)
diff --git a/mlir/utils/generate-test-checks.py b/mlir/utils/generate-test-checks.py
index f80a1811f418c..02ed4a24e122c 100755
--- a/mlir/utils/generate-test-checks.py
+++ b/mlir/utils/generate-test-checks.py
@@ -31,6 +31,7 @@
 import os  # Used to advertise this file's name ("autogenerated_note").
 import re
 import sys
+from collections import Counter
 
 ADVERT_BEGIN = "// NOTE: Assertions have been autogenerated by "
 ADVERT_END = """
@@ -45,6 +46,14 @@
 SSA_RE_STR = "[0-9]+|[a-zA-Z$._-][a-zA-Z0-9$._-]*"
 SSA_RE = re.compile(SSA_RE_STR)
 
+# Regex matching `dialect.op_name`, where `dialect` is an upstream MLIR
+# dialect (e.g. `vector.transfer_read`).
+DIALECTS = "acc|affine|amdgpu|amx|arith|arm_neon|arm_sve|arm_sme|async|bufferization|cf|complex|dlti|emitc|\
+    func|gpu|index|irdl|linalg|llvm|math|memref|ml_program|mpi|nvgpu|nvvm|omp|pdl_interp|pdl|ptr|quant|\
+    rocdl|scf|shape|shard|smt|sparse_tensor|tensor|ub|vcix|vector|wasmssa|x86vector|xegpu|xevm|spirv|tosa|\
+    transform"
+SSA_OP_NAME_RE = re.compile(rf'\b(?:{DIALECTS})[.]([a-z_]+)\b')
+
 # Regex matching the left-hand side of an assignment
 SSA_RESULTS_STR = r'\s*(%' + SSA_RE_STR + r')(\s*,\s*(%' + SSA_RE_STR + r'))*\s*='
 SSA_RESULTS_RE = re.compile(SSA_RESULTS_STR)
@@ -63,7 +72,12 @@
 class VariableNamer:
     def __init__(self, variable_names):
         self.scopes = []
+        # Counter for generic FileCHeck names, e.g. VAL_#N
         self.name_counter = 0
+        # Counters for FileCheck names derived from Op names, e.g.
+        # TRANSFER_READ_#N (based on `vector.transfer_read`). Note, there's a
+        # dedicated counter for every Op type present in the input.
+        self.op_name_counter = Counter()
 
         # Number of variable names to still generate in parent scope
         self.generate_in_parent_scope_left = 0
@@ -77,7 +91,7 @@ def generate_in_parent_scope(self, n):
         self.generate_in_parent_scope_left = n
 
     # Generate a substitution name for the given ssa value name.
-    def generate_name(self, source_variable_name, use_ssa_name):
+    def generate_name(self, source_variable_name, use_ssa_name, op_name = ""):
 
         # Compute variable name
         variable_name = self.variable_names.pop(0) if len(self.variable_names) > 0 else ''
@@ -86,8 +100,16 @@ def generate_name(self, source_variable_name, use_ssa_name):
             # a FileCHeck substation string. As FileCheck requires these
             # strings to start with a character, skip MLIR variables starting
             # with a digit (e.g. `%0`).
+            #
+            # The next fallback option is to use the op name, if the
+            # corresponding match succeeds.
+            #
+            # If neither worked, use a generic name: `VAL_#N`.
             if use_ssa_name and source_variable_name[0].isalpha():
                 variable_name = source_variable_name.upper()
+            elif op_name != "":
+                variable_name = op_name.upper() + "_" + str(self.op_name_counter[op_name])
+                self.op_name_counter[op_name] += 1
             else:
                 variable_name = "VAL_" + str(self.name_counter)
                 self.name_counter += 1
@@ -123,6 +145,7 @@ def num_scopes(self):
     def clear_names(self):
         self.name_counter = 0
         self.used_variable_names = set()
+        self.op_name_counter.clear()
 
 class AttributeNamer:
 
@@ -170,8 +193,10 @@ def process_line(line_chunks, variable_namer, use_ssa_name=False, strict_name_re
 
     # Process the rest that contained an SSA value name.
     for chunk in line_chunks:
-        m = SSA_RE.match(chunk)
-        ssa_name = m.group(0) if m is not None else ''
+        ssa = SSA_RE.match(chunk)
+        op_name_with_dialect = SSA_OP_NAME_RE.search(chunk)
+        ssa_name = ssa.group(0) if ssa is not None else ''
+        op_name = op_name_with_dialect.group(1) if op_name_with_dialect is not None else ''
 
         # Check if an existing variable exists for this name.
         variable = None
@@ -185,7 +210,8 @@ def process_line(line_chunks, variable_namer, use_ssa_name=False, strict_name_re
             output_line += "%[[" + variable + "]]"
         else:
             # Otherwise, generate a new variable.
-            variable = variable_namer.generate_name(ssa_name, use_ssa_name)
+            variable = variable_namer.generate_name(ssa_name, use_ssa_name,
+                                                    op_name)
             if strict_name_re:
                 # Use stricter regexp for the variable name, if requested.
                 # Greedy matching may cause issues with the generic '.*'

Copy link

github-actions bot commented Sep 26, 2025

✅ With the latest revision this PR passed the Python code formatter.

…ate-test-checks.py

Motivation
----------
Improve readability and stability of autogenerated CHECK lines by using
operation-aware FileCheck variable names instead of generic VAL_N.

What changes
------------
- When possible, variable names are derived from the MLIR op name, e.g.
  `vector.transfer_read` → `TRANSFER_READ_0`.
- Unknown ops (e.g., from out-of-tree dialects) fall back to the prior
  `VAL_N` scheme.

Before
------
```mlir
  // CHECK: %[[VAL_4:.*]] = vector.transfer_read ...
  // CHECK: %[[VAL_5:.*]] = "val_use"(%[[VAL_4]]) : ...
```

After
-----
```mlir
  // CHECK: %[[TRANSFER_READ_0:.*]] = vector.transfer_read ...
  // CHECK: %[[VAL_1:.*]] = "val_use"(%[[TRANSFER_READ_0]]) : ...
```

Rationale
---------
Using op-derived names (e.g., `TRANSFER_READ_0`) makes tests easier to
read and audit, while remaining more stable across unrelated edits (e.g.
there will always be fewer `TRANSFER_READ_#N` variables than `VAL_#N`).
The fallback to `VAL_N` preserves compatibility for unknown ops.
@banach-space banach-space force-pushed the andrzej/update_generate_checks branch from bd34708 to 70cb5b6 Compare September 26, 2025 09:07
@joker-eph
Copy link
Collaborator

I concerned about this because the script is not meant to provide something readable in a "ready to use" kind of way.

We shouldn't have tests with [[VAL0]] but we equally should not have [[ADDI]] : instead the author should use descriptive names in the context of the test.

Making the script better here may be counter-productive in that it would be encouraging people to not pay more attention to the naming. And reviewer are also less likely to notice when the output of the script is used "as-is".

@michaelmaitland
Copy link
Contributor

michaelmaitland commented Sep 26, 2025

I concerned about this because the script is not meant to provide something readable in a "ready to use" kind of way.

We shouldn't have tests with [[VAL0]] but we equally should not have [[ADDI]] : instead the author should use descriptive names in the context of the test.

Why is that the case? If tests shouldn't have VAL0 nor ADDI, then what should they have?

This is how the llvm update_test_checks script works, and its quite helpful. I am in favor of this change.

I concerned about this because the script is not meant to provide something readable in a "ready to use" kind of way.

Also, the update_test_checks are intended to provide something that is ready to use. Why do we feel differently about the MLIR version? I consider the checks this script generates as "ready to use" -- same goes for folks on the LLVM side.

@joker-eph
Copy link
Collaborator

joker-eph commented Sep 26, 2025

Why is that the case? If tests shouldn't have VAL0 nor ADDI, then what should they have?

Semantically meaningful names.

This is how the llvm update_test_checks script works, and its quite helpful.

It may be helpful for quickly writing tests: these are however fragile and unmaintainable. "Tests should be minimal, and only check what is absolutely necessary."

Also, the update_test_checks are intended to provide something that is ready to use. Why do we feel differently about the MLIR version? I consider the checks this script generates as "ready to use" -- same goes for folks on the LLVM side.

I don't know how LLVM got there, or what is the rationale behind this.
I personally strongly object to having tests are not minimally checking the output (that is remove everything that isn't relevant to the check. This seems fundamentally incompatible with scripting.

@banach-space
Copy link
Contributor Author

banach-space commented Sep 26, 2025

Thanks for taking a look and for the feedback!

I concerned about this because the script is not meant to provide something readable in a "ready to use" kind of way.

Note, the high-level rationale behind this script does not change:

The script is designed to make adding checks to a test case fast, it is *not*
designed to be authoritative about what constitutes a good test!

I personally strongly object to having tests are not minimally checking the output

Agreed - that’s exactly what we document under FileCheck Best Practices. This PR doesn’t change that principle; it only tweaks how placeholders are named.

Making the script better here may be counter-productive in that it would be encouraging people to not pay more attention to the naming.

To me, moving from VAL_0 to something like ADDI_0 is a small step in the right direction, and probably sufficient in many cases. Ultimately, it’s contributors and reviewers who make sure our Testing Guide is followed.

Totally agreed that no script or automation should replace human input. Personally though, I’d much rather see [[ADDI_0]] than [[VAL_0]] when the human input is skipped.

Btw, I’ll also highlight the Testing Guide at LLVM Dev to reinforce it.

Just a heads up: I’ll be offline most of next week, so my responses will be slower than usual.

@joker-eph
Copy link
Collaborator

Can we update

ADVERT_END = """
// The script is designed to make adding checks to
// a test case fast, it is *not* designed to be authoritative
// about what constitutes a good test! The CHECK should be
// minimized and named to reflect the test intent.
"""
to be even more explicit about reducing the test case and improving the naming?
(Also including a link to the testing guideline alongside)

@vzakhari
Copy link
Contributor

vzakhari commented Sep 26, 2025

I like this change! It will be a bit easier to rename the check variables with meaningful names, if they are more verbose initially.

I also support the updates suggested by Mehdi.

@amirBish
Copy link
Contributor

amirBish commented Sep 26, 2025

The change looks good!

I wonder if there is an easy way to match also the unknown dialects in a conservative way, so downstream users can get additional benefits from it, and we will not have a hardcoded list of dialects in the modified script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this list of dialects? The MLIR syntax is quite strict: there is necessarily a = after the SSA value(s) and then the op name is always formed by dialect.op_name . The RE should be able to find this without hardcoding any list.

func|gpu|index|irdl|linalg|llvm|math|memref|ml_program|mpi|nvgpu|nvvm|omp|pdl_interp|pdl|ptr|quant|\
rocdl|scf|shape|shard|smt|sparse_tensor|tensor|ub|vcix|vector|wasmssa|x86vector|xegpu|xevm|spirv|tosa|\
transform"
SSA_OP_NAME_RE = re.compile(rf"\b(?:{DIALECTS})[.]([a-z_]+)\b")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case where the dialect is elided? +1 to removing list of dialects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case where the dialect is elided?

In the input IR? Is it ever?

+1 to removing list of dialects.

Done

Copy link
Member

@jpienaar jpienaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folks should still refine test post generation indeed, but this makes the changes they need to make potentially smaller/often sufficient names and so makes it easier to trim the rest.

@banach-space
Copy link
Contributor Author

Hi folks, I've removed the hard-coded list of dialects and update the note generated by the script, thanks for the suggestions!

I wonder if there is an easy way to match also the unknown dialects in a conservative way, so downstream users can get additional benefits from it, and we will not have a hardcoded list of dialects in the modified script.

The latest revision will work for downstream dialects as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants