kaizen: fix duplicate states during closure traversal by sayrer · Pull Request #494 · timbray/quamina

sayrer · 2026-02-16T16:44:25Z

Here's what dedup_fix does relative to main:

Problem: During FA merges (especially with shell-style * patterns),
the same smallTable can end up referenced by multiple faState
nodes. When epsilon closures are computed, these duplicate table
pointers cause the same logical state to appear multiple times,
leading to redundant work during NFA traversal.

Solution — table-pointer dedup in epsilon closure
(epsi_closure.go):

Adds atableSeen map[*smallTable]bool to closureBuffers, tracked
alongside the existing closureSet (which is keyed by *faState).
In traverseEpsilons, before adding a non-epsilon-only state to the
closure, it checks whether that state's *smallTable has already
been seen. If so, the state is skipped as a duplicate.
This eliminates duplicate NFA states at the source — the closure
itself is smaller, so every downstream consumer benefits.

Consequence — removal of runtime dedup in traverseNFA (nfa.go):

The len(nextStates) > 500 sort+compact block is removed. This was
a safety valve for "toxically-complex regexps" where epsilon
loops caused huge nextStates buildups. With duplicates eliminated
at closure-computation time, this is no longer needed.
The cmp and slices imports are dropped accordingly.

In short: instead of deduplicating at traversal time
(expensive, per-event), this branch deduplicates at build time
(once, when patterns are added) by recognizing that distinct
faState pointers sharing the same smallTable are logically
identical.

    Timing (sec/op)                                                                                                                                                                                                                        
  ┌──────────────────────────────────────────┬─────────┬───────────┬──────────────────────────────────────┐                                                                                                                              
  │                Benchmark                 │  main   │ dedup_fix │                change                │                                                                                                                              
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤                                                                                                                              
  │ CityLots                                 │ 5.878µs │ 5.774µs   │ -1.78% (p=0.002)                     │                                                                                                                              
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤                                                                                                                              
  │ JsonFlattener_LastField                  │ 9.648µs │ 9.428µs   │ -2.27% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_LastField         │ 9.777µs │ 9.535µs   │ -2.47% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_ContextFields     │ 431.4ns │ 425.6ns   │ -1.33% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_MiddleNestedField │ 9.154µs │ 9.029µs   │ -1.37% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ ShellstyleMultiMatch                     │ 32.94µs │ 32.49µs   │ ~1.4% faster (not stat. significant) │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ Others                                   │         │           │ no significant change                │
  └──────────────────────────────────────────┴─────────┴───────────┴──────────────────────────────────────┘
  
  Memory: CityLots dropped from 15 to 14 B/op (-6.67%). Everything
  else unchanged. Zero allocations across the board (as before).

  Summary: Statistically significant 1.3-2.5% speedups across most
  benchmarks, with the geomean improvement at -0.66%. The gains are
  modest but consistent — the table-pointer dedup is eliminating
  redundant work in epsilon closures. The biggest wins are in the
  flattener benchmarks that exercise deeper automata. No regressions
  anywhere.

Nested quantifiers like (([abc]?)*)+ create epsilon loops that cause duplicate states to compound exponentially. The previous fix used sort+compact at a threshold of 500 states. This replaces it with an O(n) in-place dedup using a per-traversal generation counter, lowering the threshold to 64 to catch growth earlier. Zero overhead for the common case since the dedup only activates when nextStates exceeds 64. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…in epsilon closure Two faState pointers can share the same *smallTable (e.g. the + quantifier wraps an inner state's table in a new faState). During epsilon closure, both get added as distinct states producing duplicate destinations that compound exponentially. Dedup by tracking seen tables in closureBuffers, skipping states whose table was already visited. This removes the need for the runtime generation-counter dedup in traverseNFA. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nter dedup Replace tableSeen map in closureBuffers with a global generation counter (tableSeenGeneration) and a per-smallTable tableSeenGen field, eliminating the map allocation and clear overhead. Cache the active level's set pointer in transmap.activeSet on push(), removing the tm.levels[tm.depth].set indirection from the hot path in transmap.add(). This was causing a ~7% regression in ShellstyleMultiMatch due to the slice index + struct field dereference on every call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve transmap struct conflict: take main's flat [][]*fieldMatcher design with separate fieldSet on nfaBuffers, consistent with the auto-resolved traverseNFA and n2dNode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

timbray · 2026-02-17T22:34:34Z

Have the benchmark output been updated following on all those additional pushes? Because they're kind of disappointing as written.

Track the representative faState per smallTable instead of just a bool. When a second state shares the same table, compare fieldTransitions: if they match, skip (fast path); if they differ, include the state in the closure to preserve correctness. Add TestTablePointerDedupPreservesFieldTransitions to cover the case where two faState nodes share a smallTable with different field matchers, verifying epsilon closure, NFA traversal, and DFA conversion all preserve both matchers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Explain in closureBuffers why tableRep exists, what invariant it relies on (same *smallTable implies same state in current merge paths), and how the defense works when the invariant is violated (include the state, skip recursion). Add function-level doc to traverseEpsilons describing the dedup and fallback behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer · 2026-02-18T01:40:53Z

Have the benchmark output been updated following on all those additional pushes? Because they're kind of disappointing as written.

Yeah, they aren't good. But they do show we're in the right department, and there is a new test at least. I was looking in the context of DFA conversion https://github.com/sayrer/quamina/tree/lazy_dfa. (I will keep hunting tomorrow, but negative results are valuable too)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merges 12 multi-wildcard shellstyle + 6 pathological regexp patterns on the same field, exercising the table-pointer dedup in traverseEpsilons. Shows 69% speedup vs main (27.7µs → 8.5µs per op). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verifies match results for the merged shell-style + regexp pattern mix that exercises table-pointer dedup. Ground truth confirmed identical on both main and dedup_fix branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the removed >500 sort+compact with a map-based dedup that fires when nextStates exceeds 64 entries. This prevents combinatorial blowup when many closure states step to the same next faState, without adding overhead in the common case. Also adds heavy-pattern correctness and timeout tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TestBreak500Limit exercises 2925 overlapping wildcard patterns with varied input strategies. TestMeasureNextStates instruments per-byte NFA traversal to measure state expansion and dedup ratios. Both skip under -short due to pattern build time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer · 2026-02-18T03:37:17Z

  ┌──────────────────────┬─────────┬───────────┬────────────────────────┐                                                                                    
  │      Benchmark       │  main   │ dedup_fix │         change         │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ CityLots             │ 5.937µs │ 5.774µs   │ -2.75% (p=0.026)       │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ 8259Example          │ 5.631µs │ 5.660µs   │ ~ (p=0.093, no change) │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ ShellstyleMultiMatch │ 32.17µs │ 32.30µs   │ ~ (no change)          │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ PathologicalEpsilon  │ 27.70µs │ 8.54µs    │ -69.18% (p=0.002)      │
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤

TestBreak500Limit and TestMeasureNextStates build 2925 patterns, which is too slow for CI under the race detector. Run them with go test -tags stress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer · 2026-02-18T19:00:10Z

Got to this summary for BenchmarkShellstyleMultiMatch, so I think we want to take some of these benchmarks with the ones in #492, and then revisit. I think this and any nfa2dfa approach will stack up. I don't think we need all the tests in this PR right now, just showing my work to see whether this matters, and I think it does.

"This benchmark only has ~25 patterns. The automaton is small enough that the epsilon closures are trivially sized — no shared-table dedup opportunities.

The table-pointer dedup is targeting a specific scenario: after many merges, spinner states create splice states pointing to the same underlying tables. With 25 patterns there aren't enough merges to create significant duplication.

To find a real win, we'd need a benchmark with hundreds of merged shell-style patterns on the same field. That's what lazy_dfa's ShellstyleManyMatchers does — and it showed gains at higher pattern counts. This branch doesn't have
that benchmark though.

The honest conclusion: the table-pointer dedup on dedup_fix is a correctness/safety improvement (with the defense) but doesn't produce measurable throughput gains on any existing benchmark. The build-time closure optimization is
real but the closures in these tests are already small. Want to add a stress benchmark, or is this branch good as-is?"

timbray · 2026-02-18T21:33:43Z

The benchmark that most stresses out the epsilons is TestShellStyleBuildTime(); glance at it and you can see how it’s easy to adjust the size/complexity up and down. The one that actually forced me add that dedupe-at-500 count is TestToxicStack. -T

…

On Feb 18, 2026 at 11:00:32 AM, RS ***@***.***> wrote: *sayrer* left a comment (timbray/quamina#494) <#494 (comment)> Got to this summary for BenchmarkShellstyleMultiMatch, so I think we want to take some of these benchmarks with the ones in #492 <#492>, and then revisit. I think this and any nfa2dfa approach will stack up: "This benchmark only has ~25 patterns. The automaton is small enough that the epsilon closures are trivially sized — no shared-table dedup opportunities. The table-pointer dedup is targeting a specific scenario: after many merges, spinner states create splice states pointing to the same underlying tables. With 25 patterns there aren't enough merges to create significant duplication. To find a real win, we'd need a benchmark with hundreds of merged shell-style patterns on the same field. That's what lazy_dfa's ShellstyleManyMatchers does — and it showed gains at higher pattern counts. This branch doesn't have that benchmark though. The honest conclusion: the table-pointer dedup on dedup_fix is a correctness/safety improvement (with the defense) but doesn't produce measurable throughput gains on any existing benchmark. The build-time closure optimization is real but the closures in these tests are already small. Want to add a stress benchmark, or is this branch good as-is?" — Reply to this email directly, view it on GitHub <#494 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEJE3RHDSQXY3PAMSHAYL4MSZFBAVCNFSM6AAAAACVJZXHOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSMRSGU4TSNZWGU> . You are receiving this because you commented.Message ID: ***@***.***>

…dup_fix

Tests closure sizes and matching speed across workloads with nested quantifier regexps and overlapping character classes, where the table-pointer dedup has the most impact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer · 2026-02-20T20:24:13Z

Now we're getting somewhere with this test I added called TestEpsilonClosureSizes:

  ┌──────────────────────────┬────────────────────────┬─────────────────────────┬────────────┬─────────────┬────────────────────┬─────────────────────┬─────────┐
  │         Workload         │ Closure Entries (main) │ Closure Entries (dedup) │ Max (main) │ Max (dedup) │ Matches/sec (main) │ Matches/sec (dedup) │ Speedup │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ 6-regexp + 12-shell      │ 4,392                  │ 4,371                   │ 19         │ 13          │ 240K               │ 380K                │ 1.6x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ 20 nested regexps        │ 334                    │ 261                     │ 60         │ 40          │ 1.1M               │ 3.4M                │ 3.1x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ deeply nested            │ 342                    │ 220                     │ 46         │ 26          │ 681K               │ 3.2M                │ 4.7x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ overlapping char classes │ 240                    │ 156                     │ 48         │ 24          │ 1.6M               │ 4.5M                │ 2.9x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ shell + deep overlap     │ 3,488                  │ 3,410                   │ 37         │ 21          │ 204K               │ 297K                │ 1.5x    │
  └──────────────────────────┴────────────────────────┴─────────────────────────┴────────────┴─────────────┴────────────────────┴─────────────────────┴─────────┘

  The deeply-nested workload ((((a?)*b?)*c?)* patterns) shows the biggest win:
  max closure drops from 46→26, closure entries from 342→220 (36% fewer), and
  matching is 4.7x faster. The overlapping-char-classes workload((([abc]?)*)+,(
  ([bcd]?)*)+, etc.) is nearly 3x faster with max closure cut in half (48→24).

  The pattern that triggers dedup_fix most strongly is nested quantifiers over
  overlapping character classes — (([abc]?)*)+ merged with (([bcd]?)*)+ creates
  many epsilon paths converging on shared tables.

Remove exploratory/proof-of-concept test files (dedup_500_test.go, dedup_bench_test.go, dedup_measure_test.go, dedup_timeout_test.go). Replace TestEpsilonClosureSizes with TestTablePointerDedup which asserts max closure bounds and expected match counts. Add BenchmarkTablePointerDedup using testing.B.Loop() for the same workloads. Retained: TestPathologicalCorrectness (dedup_correctness_test.go) and TestTablePointerDedupPreservesFieldTransitions (epsi_closure_test.go). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer · 2026-02-20T22:50:33Z

Here are the comparison results (median of 3 runs, 3s benchtime each):

go test -bench=BenchmarkTablePointerDedup -benchtime=3s -count=3 -run='^$'

  ┌──────────────────────────┬──────────────┬───────────────────┬─────────┐
  │         Workload         │ main (ns/op) │ dedup_fix (ns/op) │ Speedup │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤
  │ 6-regexps-12-shell       │        8,874 │             7,900 │   1.12x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ 20-nested-regexps        │        2,006 │               902 │   2.22x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ deeply-nested            │        3,349 │               941 │   3.56x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ overlapping-char-classes │        1,520 │               674 │   2.25x │
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤
  │ shell+deep-overlap       │       12,243 │            10,260 │   1.19x │
  └──────────────────────────┴──────────────┴───────────────────┴─────────┘


  All workloads are 0 allocs on both branches. The dedup_fix branch shows
  significant improvement on nested quantifier regexp workloads:

  - deeply-nested: 3.56x faster (3,349 → 941 ns/op)
  - 20-nested-regexps: 2.22x faster (2,006 → 902 ns/op)
  - overlapping-char-classes: 2.25x faster (1,520 → 674 ns/op)
  - Mixed workloads with shell patterns show more modest 12-19% gains since
    shell patterns have zero table sharing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TestTablePointerDedup already covers the same patterns and verifies both match correctness and structural properties. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

timbray · 2026-02-22T18:18:02Z

This PR is starting to feel like buying a PC in the 1990s - wait a little while and next month's model will be way better. My Quamina time has been rather limited, so I haven't been keeping a close eye on this. Let me know when you think you've got it where you want it.

Replace the local map in the dedup pass with generation-based tracking using the existing lastVisitedGen field. The NFA walk compares against a snapshotted generation while the dedup pass increments and compares against the current value, so they share the field without interference. This eliminates the per-closure map allocation while keeping only one generation field (plus closureRep pointer) on smallTable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer · 2026-02-24T03:34:43Z

OK, will give it a close review and probably take it. But could you check one thing? Do you really need two generation fields in smallTable? I can guarantee you don't need more than one for build-time operations because those are guaranteed single-threaded.

Facts. We can jam this into less memory with no perf loss.

  ┌────────────────────────────────────────────┬─────────┬─────────┬───────────────┐
  │                 Benchmark                  │  main   │ branch  │  improvement  │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ 8259Example                                │ 5.683µs │ 5.676µs │ ~ (no change) │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ ShellStyleBuildTime                        │ 48.02ms │ 48.21ms │ ~ (no change) │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/6-regexps-12-shell       │ 8.804µs │ 7.747µs │ -12%          │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/20-nested-regexps        │ 1968ns  │ 893ns   │ -55%          │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/deeply-nested            │ 3322ns  │ 934ns   │ -72%          │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/overlapping-char-classes │ 1500ns  │ 666ns   │ -56%          │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/shell+deep-overlap       │ 12.43µs │ 10.04µs │ -19%          │
  └────────────────────────────────────────────┴─────────┴─────────┴───────────────┘

  Zero regressions on existing benchmarks, 12-72% faster on the dedup-heavy workloads.
   Geomean -37%.

sayrer · 2026-02-24T03:39:56Z

⏺ Clear regression:                                                                   
                                                                                      
  ┌──────────────┬──────────────────────────┬──────────────────────────┐              
  │              │           main           │          branch          │              
  ├──────────────┼──────────────────────────┼──────────────────────────┤              
  │ Patterns/sec │ 3502-3559 (median ~3524) │ 2944-2967 (median ~2955) │              
  └──────────────┴──────────────────────────┴──────────────────────────┘              
                                                                                      
  That's a consistent ~16% drop in build speed. The NFA walk revisits caused by the   
  shared generation field are adding real overhead during AddPattern. The local map   
  version would avoid this. Want me to revert to that?

sayrer · 2026-02-24T03:55:53Z

This seems to cost on patterns/sec. I will look again later.

Reverting d2563c6 and 684a093 which merged the dedup generation tracking into lastVisitedGen. Using a separate closureRepGen field avoids coupling the NFA walk and dedup pass, with no measurable performance difference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the map[*faState]bool used for closure set membership in traverseEpsilons with a generation counter on faState. This eliminates map allocations and hash operations in the hot path, improving Patterns/sec by ~55% (3170 -> 4890) with no impact on match throughput. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer · 2026-02-24T17:18:09Z

The patch here gets to within ~2-3% of main in patterns/sec. That's just the stuff we've been talking about.

The last commit here massively improves patterns/sec with no regression on other things.

Merging the generations causes a ~18% regression in patterns/sec.

timbray · 2026-02-24T17:24:11Z

Merging the generations causes a ~18% regression in patterns/sec.

By "merging the generations" you mean using the same smallTable field for deduping both in epsi_closure and FA construction? That is shocking, makes absolutely no sense to me. Any idea why?

sayrer · 2026-02-24T17:30:49Z

Here's a good explanation from Claude. The answer is we do need both closureRepGen and lastVisitedGen, which you asked about above. For the big win, that was me looking at a flamegraph, not Claude.

"Sure. Here's the walkthrough:

Setup: epsilonClosure increments closureGeneration once and snapshots it into bufs.generation. The NFA walk (closureForNfa) then uses table.lastVisitedGen == bufs.generation to skip already-visited tables.

The problem: As the NFA walk progresses, it calls closureForStateWithBufs for each state. That function increments closureGeneration twice (once for the closure set pass, once for the dedup pass). In the
merged version, the dedup pass writes:

s.table.lastVisitedGen = closureGeneration // now a NEWER value than bufs.generation

This overwrites the value the NFA walk originally set. So when closureForNfa later encounters that same table (e.g. via a different path in the graph), it checks:

if table.lastVisitedGen == bufs.generation // false! dedup overwrote it with a newer value

The check fails, so the NFA walk re-enters the table. It re-iterates all table.steps and table.epsilons, calling closureForStateWithBufs on each (which early-returns since epsilonClosure is already set) and
recursing into child tables. That traversal overhead is wasted work.

With separate closureRepGen: The dedup pass writes to s.table.closureRepGen instead, never touching lastVisitedGen. The NFA walk's visited markers stay intact, so no revisits happen.

With 7409 smallTables and closures up to 900, the number of corrupted visited markers — and resulting re-traversals — adds up to that ~18% regression."

Same approach as the closure set optimization: use closureSetGen on faState instead of allocating a map[*faState]bool per call. Eliminates map alloc/grow/rehash overhead in the merge hot path, improving Patterns/sec by ~18% (4890 -> 5750). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer · 2026-02-24T19:57:37Z

Figured I might as well get the other one of these. Now we have faster patterns/sec with better performance on the tougher regex cases, detected at build time.

 ~5750 Patterns/sec, up from ~4890. Another ~18% improvement on top of the previous change.
  ┌──────────────┬───────┬─────────────┬───────────────┐ 
  │    Metric    │ main  │ prev commit │ + this change │  
  ├──────────────┼───────┼─────────────┼───────────────┤  
  │ Patterns/sec │ ~3170 │ ~4890       │ ~5750         │  
  ├──────────────┼───────┼─────────────┼───────────────┤  
  │ vs main      │ —     │ +54%        │ +81%          │  
  └──────────────┴───────┴─────────────┴───────────────┘  
  Events/sec unchanged (~40K).

timbray · 2026-02-25T21:39:21Z

So, what we have is a significant improvement in FA build time for big complex patterns like in TestShellStyleBuildTime and a just-barely-significant slowdown in match time, both in TestShellStyleBuildTime and Benchmark8259Example. It would be nice if we could have the first but not the second. Now I'm going to look closer.

qnew/494> git checkout main; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; git checkout dedup_fix ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec'

Switched to branch 'main'
Your branch is up to date with 'origin/main'.
Adds/sec 177.9
189175/sec
Adds/sec 181.8
190162/sec
Adds/sec 178.9
190067/sec
Switched to branch 'dedup_fix'
    v2_bench_test.go:60: Adds/sec 176.7
    v2_bench_test.go:73: 187153/sec
    v2_bench_test.go:60: Adds/sec 179.0
    v2_bench_test.go:73: 187905/sec
    v2_bench_test.go:60: Adds/sec 178.2
    v2_bench_test.go:73: 186033/sec

qnew/494> git checkout main ; go test -run=TestShellStyleBuildTime | grep /sec ; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; git checkout dedup_fix | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime| grep /sec
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
Patterns/sec: 3435.2
Huge-machine events/sec: 48398.7
Patterns/sec: 3452.7
Huge-machine events/sec: 47803.8
Patterns/sec: 3450.6
Huge-machine events/sec: 47277.9
Switched to branch 'dedup_fix'
Patterns/sec: 5881.2
Huge-machine events/sec: 45620.2
Patterns/sec: 5955.1
Huge-machine events/sec: 46486.5
Patterns/sec: 5915.4
Huge-machine events/sec: 46392.3

sayrer · 2026-02-25T22:27:53Z

I think the match-time ones will get crushed if we start doing nfa2dfa. So, the idea is that build time pays off, but not in the current match algorithm. Let's not forget this one covers the tough regex ones too. Not a current use-case, since it was just added, but really difficult to cover. I think it's a local maximum, but I could be wrong.

timbray

Nothing too material here, mostly code pedantry. Except for I realized I don't understand the generation-based deduping. Consider some of this a note to myself. Will be back for more later.

epsi_closure.go

sayrer · 2026-02-26T00:34:23Z

oh jeez. Tough crowd. I'll look at them, but please do remember that this one covers algorithmic explosions in regexen. not only getting the best performance on simple ones, but also not letting the wild ones spin out. "Threading the needle" or "Scylla and Charybdis".

sayrer · 2026-02-26T17:13:20Z

We have BenchmarkShellStyleBuildTime so we can go after that events/sec regression with benchstat.

Committed dedup_fix is 2.04% slower than main on events/sec (49.52ms vs 48.53ms
per op), and it's statistically significant (p=0.000, n=10).

So the committed dedup_fix branch with all generation fields on the structs
actually hurts event-matching performance slightly — the extra 24 bytes on
smallTable (closureRepGen + closureRep) and 8 bytes on faState
(closureSetGen) are making things measurably worse, not better.

We can move that around, and get a win on both. But the Patterns/sec improvement won't be as big. Maybe we should check my lazy_dfa fork and see if that trade-off matters there.

sayrer · 2026-02-26T19:31:43Z

Hmm, this is perplexing. The NFA version is so fast now that the DFA is not worth doing. I would recommend taking this patch with its 2% regression on matching (that is now twice as fast) and then we can inspect more.

sayrer · 2026-02-26T20:47:36Z

OK, yeah. The DFA version can wing this one, so it's not worth debating 2% on the already really fast NFA version.

  ┌───────────────────────────────┬─────────┬──────────┬─────────────┐ 
  │           Benchmark           │  main   │ lazy_dfa │   Change    │  
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤ 
  │ ShellstyleMultiMatch          │ 31.80µs │ 10.29µs  │ -67.65%     │  
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤ 
  │ CityLots                      │ 5.862µs │ 5.761µs  │ -1.72%      │
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤
  │ NumberMatching                │ 881.2ns │ 853.1ns  │ -3.18%      │ 
  │ JsonFlattener (eval, context) │ 426.6ns │ 407.2ns  │ -4.55%      │
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤
  │ JsonFlattener (eval, last)    │ 9.845µs │ 9.682µs  │ -1.65%      │
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤
  │ JsonFlattener (eval, middle)  │ 9.158µs │ 9.059µs  │ -1.08%      │
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤
  │ 8259Example                   │ 5.695µs │ 4.813µs  │ ~-15% (n=1) │
  └───────────────────────────────┴─────────┴──────────┴─────────────┘
  

  The lazy DFA branch shows a ~68% speedup on ShellstyleMultiMatch (the primary
  target), with small improvements across other benchmarks and 0 allocs/op on
  all hot paths. No regressions.

My work is here: https://github.com/sayrer/quamina/tree/lazy_dfa

It's all messy, just intended to see if the CS and implementations line up. They seem to. (Claude said "yeah, the NFA is already really good, this will be challenging")

timbray · 2026-02-26T21:28:45Z

You explained once but I can't find it, so pardon the repeat: How are you testing DFA performance?

sayrer · 2026-02-26T21:43:49Z

Quamina's core matching engine is an NFA (nondeterministic finite automaton). For every byte of input, it tracks all active NFA states simultaneously — for shellstyle
patterns like abc, the number of active states grows combinatorially as partial matches overlap.

The lazy DFA converts this on-the-fly: the set of NFA states you're in becomes a single DFA state. Once you've seen the transition (stateSet, byte) → nextStateSet, you cache
it. The next time you see the same situation, it's a single pointer lookup instead of stepping every NFA state.

Key design choices:

Lazy, not eager — We don't precompute all possible DFA states (which can be exponential). We only build states we actually visit during matching.
Per-goroutine caches — No locks, no contention. Each goroutine builds its own DFA as it matches.
SIMD transition lookup — Cached transitions use bytes.IndexByte, which is a SIMD intrinsic on amd64/arm64. This makes the hot path very fast.
Memory budget (our latest work) — Instead of a fixed state count (3400), we use an 8MB byte budget. A simple pattern like foo produces DFA states with ~2 NFA states
each (~160 bytes), so we cache tens of thousands. A complex pattern like abcdefgh* produces states with hundreds of NFA states (~3KB+), so we cache fewer. The budget
adapts automatically.
Graceful degradation — When the cache fills, we fall back to NFA traversal for cache misses but snap back to the cached DFA whenever we hit a known state. Existing cached
transitions still work at full speed.

Why it pays off

The ShellstyleMultiMatch benchmark tells the story: 31.8µs → 10.3µs (68% faster). This is a warm-cache workload where the same value patterns repeat — exactly the real-world
scenario of an event-matching service processing a stream of events.

The 0 allocs/op on all hot paths means the lazy DFA adds no GC pressure once the cache is warm.

The cost is memory (up to 8MB per field per goroutine) and the one-time overhead of building each DFA state on first visit. But once built, you get DFA speed: O(1) work per
byte, no set tracking, no epsilon closure expansion.

Where it pays off most

Repeated matching against the same patterns (event routing, log filtering) — the cache warms up quickly and subsequent matches are near-DFA speed
Shellstyle patterns with multiple wildcards — these create large NFA state sets where the per-byte cost difference between NFA and DFA is largest
High-throughput workloads — the 0-alloc hot path means throughput scales without GC pauses

Where it helps less: one-shot matches against unique inputs (cache never warms), or simple exact-match patterns (NFA is already fast, few states to track).

The architecture is sound. The main lever now is tuning — the 8MB budget, and potentially pattern-specific heuristics for when lazy DFA is worth enabling vs. just running
the NFA directly.

──────────────────────

sayrer · 2026-02-26T21:44:52Z

go test -bench=BenchmarkShellstyleMultiMatch -benchmem -count=6 -run='^$'

sayrer · 2026-02-26T21:54:31Z

But none of this work is in this patch. It's only to point out that we should not sweat 2%, since we can smash it.

timbray · 2026-02-26T23:09:20Z

OK, I got it, I had to do some hand-tracing. The closureRepGen just tells you whether you've seen this or not in this particular closure. I think some of the commentary is misleading. Anyhow, sorry for persistently lagging behind you, I have a bunch of boring year-end clerical work occupying bandwidth. I did drop a couple suggestions in in the N-1 review and sometime probably before this time tomorrow I'll finish my pass through. Unless I find something else that bothers me, I guess I'm bought in on the core idea of this PR. But wow, so many commits.

BTW you're still going to have a tough time selling lazy DFAs because, as I've said before, building DFAs once per thread is horribly wasteful of memory. In my experience with Quamina's predecessor, basically all the production deployments were multi-threaded.

Need to get this landed for me to make any progress on DFA budgeting because every freaking piece of nontrivial FA-generating code needs to be refactored to pass a MemoryHeadroom counter back and forth.

sayrer · 2026-02-26T23:55:10Z

Anyhow, sorry

I don't understand Canadian :)

building DFAs once per thread is horribly wasteful of memory. In my experience with Quamina's predecessor, basically all the production deployments were multi-threaded.

Oh, that's in there. I do a lot of sloppy ones to see if an idea is worth exploring. I have the memory footprint one already, but it needs the good NFA and eager DFA first. The problem naturally arises when you try to do a memory budget.

You're right to insist on clean patches (do sqaush-and-merge), but I check them out with messy ones before polishing a turd.

It's also OK to insist that the main branch is pristine. But the natural result will be PRs and branches that iterate way faster, unless people don't share their work.

(Ed: the squash and merge option should be an option in the green button at the bottom, that makes this PR one commit in Quamina's git history)

timbray

OK, I'm done, if you could glance at the non-resolved review issues.

epsi_closure_test.go

nfa.go

small_table.go

sayrer · 2026-02-28T01:50:35Z

OK, these all seem reasonable. I'll get them in the morning.

For the lazy DFA design, you can get it to be a global cache, and this is better in some ways. Once one goroutine discovers an optimization, everyone gets it. I hacked it out in that branch I would not submit as a PR. I think there would need to be an intermediate stage where the lazy DFA is wastefully per-goroutine (as you point out). But this is just an investigation: can you get it faster by doing this? I don't know for sure because the NFA is getting better, but I suspect so. Once that is proven, then there's a tricky global cache management problem. I think it's not possible to aim for the end state right away, since it will be unreviewable. But that's where my head is. It has to be in distinct steps: first verify that the DFA actually wins while wasting memory, then see if it still wins with synchronization wrappers.

- Move closureBuffers type declaration above first use, add field comments - Add comment documenting order-dependent fieldTransitions comparison - Rename closureForStateWithBufs → closureForState (test helper → closureForStateNoBufs) - Rename flattenEpsilonTargets/flattenCollect → simplifySplices/simplifyCollect - Rename closureRepGen → closureGen (symmetrical with lastVisitedGen) - Update dedup test to use overlapping fieldTransitions in different order Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer and others added 6 commits February 13, 2026 12:25

Add documentation for table-pointer dedup in epsilon closure

6e20fa6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into dedup_fix

3c10159

Resolve transmap struct conflict: take main's flat [][]*fieldMatcher design with separate fieldSet on nfaBuffers, consistent with the auto-resolved traverseNFA and n2dNode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove table-pointer-dedup rationale doc

b78fa12

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer and others added 2 commits February 17, 2026 15:44

sayrer and others added 5 commits February 17, 2026 17:46

Fix gofmt alignment in epsi_closure.go struct literals

43e4ece

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Gate heavy stress tests behind //go:build stress

7b6eb7d

TestBreak500Limit and TestMeasureNextStates build 2925 patterns, which is too slow for CI under the race detector. Run them with go test -tags stress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer mentioned this pull request Feb 18, 2026

kaizen: Add BenchmarkShellStyleBuildTime using testing.B #495

Merged

sayrer and others added 3 commits February 19, 2026 15:31

Merge branch 'main' into dedup_fix

eb5e016

Merge branch 'dedup_fix' of https://github.com/sayrer/quamina into de…

ac658fc

…dup_fix

Assert all collectClosureStats return values in TestTablePointerDedup

ecfe50f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer force-pushed the dedup_fix branch from 26694ac to ecfe50f Compare February 20, 2026 23:19

Remove redundant TestPathologicalCorrectness

e248cd3

TestTablePointerDedup already covers the same patterns and verifies both match correctness and structural properties. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sayrer force-pushed the dedup_fix branch from dab47e0 to 684a093 Compare February 24, 2026 03:20

sayrer and others added 2 commits February 24, 2026 08:49

timbray reviewed Feb 26, 2026

View reviewed changes

epsi_closure.go Outdated Show resolved Hide resolved

epsi_closure.go Show resolved Hide resolved

epsi_closure.go Outdated Show resolved Hide resolved

epsi_closure.go Show resolved Hide resolved

timbray reviewed Feb 27, 2026

View reviewed changes

epsi_closure_test.go Outdated Show resolved Hide resolved

nfa.go Outdated Show resolved Hide resolved

small_table.go Outdated Show resolved Hide resolved

sayrer and others added 2 commits February 28, 2026 08:43

Merge branch 'main' into dedup_fix

1e703a1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

timbray merged commit d915467 into timbray:main Feb 28, 2026
7 checks passed

Conversation

sayrer commented Feb 16, 2026

Uh oh!

timbray commented Feb 17, 2026

Uh oh!

sayrer commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Feb 18, 2026

Uh oh!

sayrer commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Feb 18, 2026 via email

Uh oh!

sayrer commented Feb 20, 2026

Uh oh!

sayrer commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Feb 22, 2026

Uh oh!

sayrer commented Feb 24, 2026

Uh oh!

sayrer commented Feb 24, 2026

Uh oh!

sayrer commented Feb 24, 2026

Uh oh!

sayrer commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Feb 24, 2026

Uh oh!

sayrer commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Feb 25, 2026

Uh oh!

sayrer commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayrer commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Feb 26, 2026

Uh oh!

sayrer commented Feb 26, 2026

Uh oh!

sayrer commented Feb 26, 2026

Uh oh!

sayrer commented Feb 26, 2026

Uh oh!

timbray commented Feb 26, 2026

Uh oh!

sayrer commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayrer commented Feb 18, 2026 •

edited

Loading

sayrer commented Feb 18, 2026 •

edited

Loading

sayrer commented Feb 20, 2026 •

edited

Loading

sayrer commented Feb 24, 2026 •

edited

Loading

sayrer commented Feb 24, 2026 •

edited

Loading

sayrer commented Feb 24, 2026 •

edited

Loading

sayrer commented Feb 25, 2026 •

edited

Loading

sayrer commented Feb 26, 2026 •

edited

Loading

sayrer commented Feb 26, 2026 •

edited

Loading

sayrer commented Feb 26, 2026 •

edited

Loading

sayrer commented Feb 26, 2026 •

edited

Loading

sayrer commented Feb 26, 2026 •

edited

Loading

sayrer commented Feb 28, 2026 •

edited

Loading