Skip to content

kaizen: fix duplicate states during closure traversal#494

Merged
timbray merged 37 commits intotimbray:mainfrom
sayrer:dedup_fix
Feb 28, 2026
Merged

kaizen: fix duplicate states during closure traversal#494
timbray merged 37 commits intotimbray:mainfrom
sayrer:dedup_fix

Conversation

@sayrer
Copy link
Contributor

@sayrer sayrer commented Feb 16, 2026

Here's what dedup_fix does relative to main:

Problem: During FA merges (especially with shell-style * patterns),
the same smallTable can end up referenced by multiple faState
nodes. When epsilon closures are computed, these duplicate table
pointers cause the same logical state to appear multiple times,
leading to redundant work during NFA traversal.

Solution — table-pointer dedup in epsilon closure
(epsi_closure.go):

  • Adds atableSeen map[*smallTable]bool to closureBuffers, tracked
    alongside the existing closureSet (which is keyed by *faState).
  • In traverseEpsilons, before adding a non-epsilon-only state to the
    closure, it checks whether that state's *smallTable has already
    been seen. If so, the state is skipped as a duplicate.
  • This eliminates duplicate NFA states at the source — the closure
    itself is smaller, so every downstream consumer benefits.

Consequence — removal of runtime dedup in traverseNFA (nfa.go):

  • The len(nextStates) > 500 sort+compact block is removed. This was
    a safety valve for "toxically-complex regexps" where epsilon
    loops caused huge nextStates buildups. With duplicates eliminated
    at closure-computation time, this is no longer needed.
  • The cmp and slices imports are dropped accordingly.

In short: instead of deduplicating at traversal time
(expensive, per-event), this branch deduplicates at build time
(once, when patterns are added) by recognizing that distinct
faState pointers sharing the same smallTable are logically
identical.

    Timing (sec/op)                                                                                                                                                                                                                        
  ┌──────────────────────────────────────────┬─────────┬───────────┬──────────────────────────────────────┐                                                                                                                              
  │                Benchmark                 │  main   │ dedup_fix │                change                │                                                                                                                              
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤                                                                                                                              
  │ CityLots                                 │ 5.878µs │ 5.774µs   │ -1.78% (p=0.002)                     │                                                                                                                              
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤                                                                                                                              
  │ JsonFlattener_LastField                  │ 9.648µs │ 9.428µs   │ -2.27% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_LastField         │ 9.777µs │ 9.535µs   │ -2.47% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_ContextFields     │ 431.4ns │ 425.6ns   │ -1.33% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_MiddleNestedField │ 9.154µs │ 9.029µs   │ -1.37% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ ShellstyleMultiMatch                     │ 32.94µs │ 32.49µs   │ ~1.4% faster (not stat. significant) │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ Others                                   │         │           │ no significant change                │
  └──────────────────────────────────────────┴─────────┴───────────┴──────────────────────────────────────┘
  
  Memory: CityLots dropped from 15 to 14 B/op (-6.67%). Everything
  else unchanged. Zero allocations across the board (as before).

  Summary: Statistically significant 1.3-2.5% speedups across most
  benchmarks, with the geomean improvement at -0.66%. The gains are
  modest but consistent — the table-pointer dedup is eliminating
  redundant work in epsilon closures. The biggest wins are in the
  flattener benchmarks that exercise deeper automata. No regressions
  anywhere.

sayrer and others added 6 commits February 13, 2026 12:25
Nested quantifiers like (([abc]?)*)+ create epsilon loops that cause
duplicate states to compound exponentially. The previous fix used
sort+compact at a threshold of 500 states. This replaces it with an
O(n) in-place dedup using a per-traversal generation counter, lowering
the threshold to 64 to catch growth earlier. Zero overhead for the
common case since the dedup only activates when nextStates exceeds 64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in epsilon closure

Two faState pointers can share the same *smallTable (e.g. the + quantifier
wraps an inner state's table in a new faState). During epsilon closure, both
get added as distinct states producing duplicate destinations that compound
exponentially. Dedup by tracking seen tables in closureBuffers, skipping
states whose table was already visited. This removes the need for the
runtime generation-counter dedup in traverseNFA.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nter dedup

Replace tableSeen map in closureBuffers with a global generation counter
(tableSeenGeneration) and a per-smallTable tableSeenGen field, eliminating
the map allocation and clear overhead.

Cache the active level's set pointer in transmap.activeSet on push(),
removing the tm.levels[tm.depth].set indirection from the hot path in
transmap.add(). This was causing a ~7% regression in ShellstyleMultiMatch
due to the slice index + struct field dereference on every call.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve transmap struct conflict: take main's flat [][]*fieldMatcher
design with separate fieldSet on nfaBuffers, consistent with the
auto-resolved traverseNFA and n2dNode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@timbray
Copy link
Owner

timbray commented Feb 17, 2026

Have the benchmark output been updated following on all those additional pushes? Because they're kind of disappointing as written.

sayrer and others added 2 commits February 17, 2026 15:44
Track the representative faState per smallTable instead of just a bool.
When a second state shares the same table, compare fieldTransitions:
if they match, skip (fast path); if they differ, include the state
in the closure to preserve correctness.

Add TestTablePointerDedupPreservesFieldTransitions to cover the case
where two faState nodes share a smallTable with different field
matchers, verifying epsilon closure, NFA traversal, and DFA conversion
all preserve both matchers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain in closureBuffers why tableRep exists, what invariant it
relies on (same *smallTable implies same state in current merge
paths), and how the defense works when the invariant is violated
(include the state, skip recursion). Add function-level doc to
traverseEpsilons describing the dedup and fallback behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 18, 2026

Have the benchmark output been updated following on all those additional pushes? Because they're kind of disappointing as written.

Yeah, they aren't good. But they do show we're in the right department, and there is a new test at least. I was looking in the context of DFA conversion https://github.com/sayrer/quamina/tree/lazy_dfa. (I will keep hunting tomorrow, but negative results are valuable too)

sayrer and others added 5 commits February 17, 2026 17:46
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merges 12 multi-wildcard shellstyle + 6 pathological regexp patterns on
the same field, exercising the table-pointer dedup in traverseEpsilons.
Shows 69% speedup vs main (27.7µs → 8.5µs per op).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies match results for the merged shell-style + regexp pattern mix
that exercises table-pointer dedup. Ground truth confirmed identical on
both main and dedup_fix branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the removed >500 sort+compact with a map-based dedup that fires
when nextStates exceeds 64 entries. This prevents combinatorial blowup
when many closure states step to the same next faState, without adding
overhead in the common case.

Also adds heavy-pattern correctness and timeout tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestBreak500Limit exercises 2925 overlapping wildcard patterns with
varied input strategies. TestMeasureNextStates instruments per-byte
NFA traversal to measure state expansion and dedup ratios. Both skip
under -short due to pattern build time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 18, 2026

  ┌──────────────────────┬─────────┬───────────┬────────────────────────┐                                                                                    
  │      Benchmark       │  main   │ dedup_fix │         change         │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ CityLots             │ 5.937µs │ 5.774µs   │ -2.75% (p=0.026)       │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ 8259Example          │ 5.631µs │ 5.660µs   │ ~ (p=0.093, no change) │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ ShellstyleMultiMatch │ 32.17µs │ 32.30µs   │ ~ (no change)          │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ PathologicalEpsilon  │ 27.70µs │ 8.54µs    │ -69.18% (p=0.002)      │
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤

TestBreak500Limit and TestMeasureNextStates build 2925 patterns,
which is too slow for CI under the race detector. Run them with
go test -tags stress.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 18, 2026

Got to this summary for BenchmarkShellstyleMultiMatch, so I think we want to take some of these benchmarks with the ones in #492, and then revisit. I think this and any nfa2dfa approach will stack up. I don't think we need all the tests in this PR right now, just showing my work to see whether this matters, and I think it does.

"This benchmark only has ~25 patterns. The automaton is small enough that the epsilon closures are trivially sized — no shared-table dedup opportunities.

The table-pointer dedup is targeting a specific scenario: after many merges, spinner states create splice states pointing to the same underlying tables. With 25 patterns there aren't enough merges to create significant duplication.

To find a real win, we'd need a benchmark with hundreds of merged shell-style patterns on the same field. That's what lazy_dfa's ShellstyleManyMatchers does — and it showed gains at higher pattern counts. This branch doesn't have
that benchmark though.

The honest conclusion: the table-pointer dedup on dedup_fix is a correctness/safety improvement (with the defense) but doesn't produce measurable throughput gains on any existing benchmark. The build-time closure optimization is
real but the closures in these tests are already small. Want to add a stress benchmark, or is this branch good as-is?"

@timbray
Copy link
Owner

timbray commented Feb 18, 2026 via email

sayrer and others added 3 commits February 19, 2026 15:31
Tests closure sizes and matching speed across workloads with nested
quantifier regexps and overlapping character classes, where the
table-pointer dedup has the most impact.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 20, 2026

Now we're getting somewhere with this test I added called TestEpsilonClosureSizes:

  ┌──────────────────────────┬────────────────────────┬─────────────────────────┬────────────┬─────────────┬────────────────────┬─────────────────────┬─────────┐
  │         Workload         │ Closure Entries (main) │ Closure Entries (dedup) │ Max (main) │ Max (dedup) │ Matches/sec (main) │ Matches/sec (dedup) │ Speedup │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ 6-regexp + 12-shell      │ 4,392                  │ 4,371                   │ 19         │ 13          │ 240K               │ 380K                │ 1.6x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ 20 nested regexps        │ 334                    │ 261                     │ 60         │ 40          │ 1.1M               │ 3.4M                │ 3.1x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ deeply nested            │ 342                    │ 220                     │ 46         │ 26          │ 681K               │ 3.2M                │ 4.7x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ overlapping char classes │ 240                    │ 156                     │ 48         │ 24          │ 1.6M               │ 4.5M                │ 2.9x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ shell + deep overlap     │ 3,488                  │ 3,410                   │ 37         │ 21          │ 204K               │ 297K                │ 1.5x    │
  └──────────────────────────┴────────────────────────┴─────────────────────────┴────────────┴─────────────┴────────────────────┴─────────────────────┴─────────┘

  The deeply-nested workload ((((a?)*b?)*c?)* patterns) shows the biggest win:
  max closure drops from 46→26, closure entries from 342→220 (36% fewer), and
  matching is 4.7x faster. The overlapping-char-classes workload((([abc]?)*)+,(
  ([bcd]?)*)+, etc.) is nearly 3x faster with max closure cut in half (48→24).

  The pattern that triggers dedup_fix most strongly is nested quantifiers over
  overlapping character classes — (([abc]?)*)+ merged with (([bcd]?)*)+ creates
  many epsilon paths converging on shared tables.

Remove exploratory/proof-of-concept test files (dedup_500_test.go,
dedup_bench_test.go, dedup_measure_test.go, dedup_timeout_test.go).

Replace TestEpsilonClosureSizes with TestTablePointerDedup which
asserts max closure bounds and expected match counts. Add
BenchmarkTablePointerDedup using testing.B.Loop() for the same
workloads.

Retained: TestPathologicalCorrectness (dedup_correctness_test.go)
and TestTablePointerDedupPreservesFieldTransitions (epsi_closure_test.go).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 20, 2026

Here are the comparison results (median of 3 runs, 3s benchtime each):

go test -bench=BenchmarkTablePointerDedup -benchtime=3s -count=3 -run='^$'

  ┌──────────────────────────┬──────────────┬───────────────────┬─────────┐
  │         Workload         │ main (ns/op) │ dedup_fix (ns/op) │ Speedup │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤
  │ 6-regexps-12-shell       │        8,874 │             7,900 │   1.12x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ 20-nested-regexps        │        2,006 │               902 │   2.22x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ deeply-nested            │        3,349 │               941 │   3.56x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ overlapping-char-classes │        1,520 │               674 │   2.25x │
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤
  │ shell+deep-overlap       │       12,243 │            10,260 │   1.19x │
  └──────────────────────────┴──────────────┴───────────────────┴─────────┘


  All workloads are 0 allocs on both branches. The dedup_fix branch shows
  significant improvement on nested quantifier regexp workloads:

  - deeply-nested: 3.56x faster (3,349 → 941 ns/op)
  - 20-nested-regexps: 2.22x faster (2,006 → 902 ns/op)
  - overlapping-char-classes: 2.25x faster (1,520 → 674 ns/op)
  - Mixed workloads with shell patterns show more modest 12-19% gains since
    shell patterns have zero table sharing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestTablePointerDedup already covers the same patterns and verifies
both match correctness and structural properties.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@timbray
Copy link
Owner

timbray commented Feb 22, 2026

This PR is starting to feel like buying a PC in the 1990s - wait a little while and next month's model will be way better. My Quamina time has been rather limited, so I haven't been keeping a close eye on this. Let me know when you think you've got it where you want it.

Replace the local map in the dedup pass with generation-based tracking
using the existing lastVisitedGen field. The NFA walk compares against
a snapshotted generation while the dedup pass increments and compares
against the current value, so they share the field without interference.
This eliminates the per-closure map allocation while keeping only one
generation field (plus closureRep pointer) on smallTable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 24, 2026

OK, will give it a close review and probably take it. But could you check one thing? Do you really need two generation fields in smallTable? I can guarantee you don't need more than one for build-time operations because those are guaranteed single-threaded.

Facts. We can jam this into less memory with no perf loss.

  ┌────────────────────────────────────────────┬─────────┬─────────┬───────────────┐
  │                 Benchmark                  │  main   │ branch  │  improvement  │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ 8259Example                                │ 5.683µs │ 5.676µs │ ~ (no change) │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ ShellStyleBuildTime                        │ 48.02ms │ 48.21ms │ ~ (no change) │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/6-regexps-12-shell       │ 8.804µs │ 7.747µs │ -12%          │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/20-nested-regexps        │ 1968ns  │ 893ns   │ -55%          │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/deeply-nested            │ 3322ns  │ 934ns   │ -72%          │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/overlapping-char-classes │ 1500ns  │ 666ns   │ -56%          │
  ├────────────────────────────────────────────┼─────────┼─────────┼───────────────┤
  │ TablePointerDedup/shell+deep-overlap       │ 12.43µs │ 10.04µs │ -19%          │
  └────────────────────────────────────────────┴─────────┴─────────┴───────────────┘

  Zero regressions on existing benchmarks, 12-72% faster on the dedup-heavy workloads.
   Geomean -37%.

@sayrer
Copy link
Contributor Author

sayrer commented Feb 24, 2026

⏺ Clear regression:                                                                   
                                                                                      
  ┌──────────────┬──────────────────────────┬──────────────────────────┐              
  │              │           main           │          branch          │              
  ├──────────────┼──────────────────────────┼──────────────────────────┤              
  │ Patterns/sec │ 3502-3559 (median ~3524) │ 2944-2967 (median ~2955) │              
  └──────────────┴──────────────────────────┴──────────────────────────┘              
                                                                                      
  That's a consistent ~16% drop in build speed. The NFA walk revisits caused by the   
  shared generation field are adding real overhead during AddPattern. The local map   
  version would avoid this. Want me to revert to that?  

@sayrer
Copy link
Contributor Author

sayrer commented Feb 24, 2026

This seems to cost on patterns/sec. I will look again later.

sayrer and others added 2 commits February 24, 2026 08:49
Reverting d2563c6 and 684a093 which merged the dedup generation tracking
into lastVisitedGen. Using a separate closureRepGen field avoids coupling
the NFA walk and dedup pass, with no measurable performance difference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the map[*faState]bool used for closure set membership in
traverseEpsilons with a generation counter on faState. This eliminates
map allocations and hash operations in the hot path, improving
Patterns/sec by ~55% (3170 -> 4890) with no impact on match throughput.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 24, 2026

The patch here gets to within ~2-3% of main in patterns/sec. That's just the stuff we've been talking about.

The last commit here massively improves patterns/sec with no regression on other things.

Merging the generations causes a ~18% regression in patterns/sec.

@timbray
Copy link
Owner

timbray commented Feb 24, 2026

Merging the generations causes a ~18% regression in patterns/sec.

By "merging the generations" you mean using the same smallTable field for deduping both in epsi_closure and FA construction? That is shocking, makes absolutely no sense to me. Any idea why?

@sayrer
Copy link
Contributor Author

sayrer commented Feb 24, 2026

Here's a good explanation from Claude. The answer is we do need both closureRepGen and lastVisitedGen, which you asked about above. For the big win, that was me looking at a flamegraph, not Claude.

"Sure. Here's the walkthrough:

Setup: epsilonClosure increments closureGeneration once and snapshots it into bufs.generation. The NFA walk (closureForNfa) then uses table.lastVisitedGen == bufs.generation to skip already-visited tables.

The problem: As the NFA walk progresses, it calls closureForStateWithBufs for each state. That function increments closureGeneration twice (once for the closure set pass, once for the dedup pass). In the
merged version, the dedup pass writes:

s.table.lastVisitedGen = closureGeneration // now a NEWER value than bufs.generation

This overwrites the value the NFA walk originally set. So when closureForNfa later encounters that same table (e.g. via a different path in the graph), it checks:

if table.lastVisitedGen == bufs.generation // false! dedup overwrote it with a newer value

The check fails, so the NFA walk re-enters the table. It re-iterates all table.steps and table.epsilons, calling closureForStateWithBufs on each (which early-returns since epsilonClosure is already set) and
recursing into child tables. That traversal overhead is wasted work.

With separate closureRepGen: The dedup pass writes to s.table.closureRepGen instead, never touching lastVisitedGen. The NFA walk's visited markers stay intact, so no revisits happen.

With 7409 smallTables and closures up to 900, the number of corrupted visited markers — and resulting re-traversals — adds up to that ~18% regression."

Same approach as the closure set optimization: use closureSetGen on
faState instead of allocating a map[*faState]bool per call. Eliminates
map alloc/grow/rehash overhead in the merge hot path, improving
Patterns/sec by ~18% (4890 -> 5750).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 24, 2026

Figured I might as well get the other one of these. Now we have faster patterns/sec with better performance on the tougher regex cases, detected at build time.

 ~5750 Patterns/sec, up from ~4890. Another ~18% improvement on top of the previous change.
  ┌──────────────┬───────┬─────────────┬───────────────┐ 
  │    Metric    │ main  │ prev commit │ + this change │  
  ├──────────────┼───────┼─────────────┼───────────────┤  
  │ Patterns/sec │ ~3170 │ ~4890       │ ~5750         │  
  ├──────────────┼───────┼─────────────┼───────────────┤  
  │ vs main      │ —     │ +54%        │ +81%          │  
  └──────────────┴───────┴─────────────┴───────────────┘  
  Events/sec unchanged (~40K).

@timbray
Copy link
Owner

timbray commented Feb 25, 2026

So, what we have is a significant improvement in FA build time for big complex patterns like in TestShellStyleBuildTime and a just-barely-significant slowdown in match time, both in TestShellStyleBuildTime and Benchmark8259Example. It would be nice if we could have the first but not the second. Now I'm going to look closer.

qnew/494> git checkout main; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; git checkout dedup_fix ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec'    
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
Adds/sec 177.9
189175/sec
Adds/sec 181.8
190162/sec
Adds/sec 178.9
190067/sec
Switched to branch 'dedup_fix'
    v2_bench_test.go:60: Adds/sec 176.7
    v2_bench_test.go:73: 187153/sec
    v2_bench_test.go:60: Adds/sec 179.0
    v2_bench_test.go:73: 187905/sec
    v2_bench_test.go:60: Adds/sec 178.2
    v2_bench_test.go:73: 186033/sec
qnew/494> git checkout main ; go test -run=TestShellStyleBuildTime | grep /sec ; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; git checkout dedup_fix | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime| grep /sec
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
Patterns/sec: 3435.2
Huge-machine events/sec: 48398.7
Patterns/sec: 3452.7
Huge-machine events/sec: 47803.8
Patterns/sec: 3450.6
Huge-machine events/sec: 47277.9
Switched to branch 'dedup_fix'
Patterns/sec: 5881.2
Huge-machine events/sec: 45620.2
Patterns/sec: 5955.1
Huge-machine events/sec: 46486.5
Patterns/sec: 5915.4
Huge-machine events/sec: 46392.3

@sayrer
Copy link
Contributor Author

sayrer commented Feb 25, 2026

I think the match-time ones will get crushed if we start doing nfa2dfa. So, the idea is that build time pays off, but not in the current match algorithm. Let's not forget this one covers the tough regex ones too. Not a current use-case, since it was just added, but really difficult to cover. I think it's a local maximum, but I could be wrong.

Copy link
Owner

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing too material here, mostly code pedantry. Except for I realized I don't understand the generation-based deduping. Consider some of this a note to myself. Will be back for more later.

@sayrer
Copy link
Contributor Author

sayrer commented Feb 26, 2026

oh jeez. Tough crowd. I'll look at them, but please do remember that this one covers algorithmic explosions in regexen. not only getting the best performance on simple ones, but also not letting the wild ones spin out. "Threading the needle" or "Scylla and Charybdis".

@sayrer
Copy link
Contributor Author

sayrer commented Feb 26, 2026

We have BenchmarkShellStyleBuildTime so we can go after that events/sec regression with benchstat.

Committed dedup_fix is 2.04% slower than main on events/sec (49.52ms vs 48.53ms
per op), and it's statistically significant (p=0.000, n=10).

So the committed dedup_fix branch with all generation fields on the structs
actually hurts event-matching performance slightly — the extra 24 bytes on
smallTable (closureRepGen + closureRep) and 8 bytes on faState
(closureSetGen) are making things measurably worse, not better.  

We can move that around, and get a win on both. But the Patterns/sec improvement won't be as big. Maybe we should check my lazy_dfa fork and see if that trade-off matters there.

@sayrer
Copy link
Contributor Author

sayrer commented Feb 26, 2026

Hmm, this is perplexing. The NFA version is so fast now that the DFA is not worth doing. I would recommend taking this patch with its 2% regression on matching (that is now twice as fast) and then we can inspect more.

@sayrer
Copy link
Contributor Author

sayrer commented Feb 26, 2026

OK, yeah. The DFA version can wing this one, so it's not worth debating 2% on the already really fast NFA version.

  ┌───────────────────────────────┬─────────┬──────────┬─────────────┐ 
  │           Benchmark           │  main   │ lazy_dfa │   Change    │  
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤ 
  │ ShellstyleMultiMatch          │ 31.80µs │ 10.29µs  │ -67.65%     │  
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤ 
  │ CityLots                      │ 5.862µs │ 5.761µs  │ -1.72%      │
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤
  │ NumberMatching                │ 881.2ns │ 853.1ns  │ -3.18%      │ 
  │ JsonFlattener (eval, context) │ 426.6ns │ 407.2ns  │ -4.55%      │
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤
  │ JsonFlattener (eval, last)    │ 9.845µs │ 9.682µs  │ -1.65%      │
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤
  │ JsonFlattener (eval, middle)  │ 9.158µs │ 9.059µs  │ -1.08%      │
  ├───────────────────────────────┼─────────┼──────────┼─────────────┤
  │ 8259Example                   │ 5.695µs │ 4.813µs  │ ~-15% (n=1) │
  └───────────────────────────────┴─────────┴──────────┴─────────────┘
  

  The lazy DFA branch shows a ~68% speedup on ShellstyleMultiMatch (the primary
  target), with small improvements across other benchmarks and 0 allocs/op on
  all hot paths. No regressions.

My work is here: https://github.com/sayrer/quamina/tree/lazy_dfa

It's all messy, just intended to see if the CS and implementations line up. They seem to. (Claude said "yeah, the NFA is already really good, this will be challenging")

@timbray
Copy link
Owner

timbray commented Feb 26, 2026

You explained once but I can't find it, so pardon the repeat: How are you testing DFA performance?

@sayrer
Copy link
Contributor Author

sayrer commented Feb 26, 2026

Quamina's core matching engine is an NFA (nondeterministic finite automaton). For every byte of input, it tracks all active NFA states simultaneously — for shellstyle
patterns like abc, the number of active states grows combinatorially as partial matches overlap.

The lazy DFA converts this on-the-fly: the set of NFA states you're in becomes a single DFA state. Once you've seen the transition (stateSet, byte) → nextStateSet, you cache
it. The next time you see the same situation, it's a single pointer lookup instead of stepping every NFA state.

Key design choices:

  1. Lazy, not eager — We don't precompute all possible DFA states (which can be exponential). We only build states we actually visit during matching.
  2. Per-goroutine caches — No locks, no contention. Each goroutine builds its own DFA as it matches.
  3. SIMD transition lookup — Cached transitions use bytes.IndexByte, which is a SIMD intrinsic on amd64/arm64. This makes the hot path very fast.
  4. Memory budget (our latest work) — Instead of a fixed state count (3400), we use an 8MB byte budget. A simple pattern like foo produces DFA states with ~2 NFA states
    each (~160 bytes), so we cache tens of thousands. A complex pattern like abcdefgh* produces states with hundreds of NFA states (~3KB+), so we cache fewer. The budget
    adapts automatically.
  5. Graceful degradation — When the cache fills, we fall back to NFA traversal for cache misses but snap back to the cached DFA whenever we hit a known state. Existing cached
    transitions still work at full speed.

Why it pays off

The ShellstyleMultiMatch benchmark tells the story: 31.8µs → 10.3µs (68% faster). This is a warm-cache workload where the same value patterns repeat — exactly the real-world
scenario of an event-matching service processing a stream of events.

The 0 allocs/op on all hot paths means the lazy DFA adds no GC pressure once the cache is warm.

The cost is memory (up to 8MB per field per goroutine) and the one-time overhead of building each DFA state on first visit. But once built, you get DFA speed: O(1) work per
byte, no set tracking, no epsilon closure expansion.

Where it pays off most

  • Repeated matching against the same patterns (event routing, log filtering) — the cache warms up quickly and subsequent matches are near-DFA speed
  • Shellstyle patterns with multiple wildcards — these create large NFA state sets where the per-byte cost difference between NFA and DFA is largest
  • High-throughput workloads — the 0-alloc hot path means throughput scales without GC pauses

Where it helps less: one-shot matches against unique inputs (cache never warms), or simple exact-match patterns (NFA is already fast, few states to track).

The architecture is sound. The main lever now is tuning — the 8MB budget, and potentially pattern-specific heuristics for when lazy DFA is worth enabling vs. just running
the NFA directly.

──────────────────────

@sayrer
Copy link
Contributor Author

sayrer commented Feb 26, 2026

go test -bench=BenchmarkShellstyleMultiMatch -benchmem -count=6 -run='^$'

@sayrer
Copy link
Contributor Author

sayrer commented Feb 26, 2026

But none of this work is in this patch. It's only to point out that we should not sweat 2%, since we can smash it.

@timbray
Copy link
Owner

timbray commented Feb 26, 2026

OK, I got it, I had to do some hand-tracing. The closureRepGen just tells you whether you've seen this or not in this particular closure. I think some of the commentary is misleading. Anyhow, sorry for persistently lagging behind you, I have a bunch of boring year-end clerical work occupying bandwidth. I did drop a couple suggestions in in the N-1 review and sometime probably before this time tomorrow I'll finish my pass through. Unless I find something else that bothers me, I guess I'm bought in on the core idea of this PR. But wow, so many commits.

BTW you're still going to have a tough time selling lazy DFAs because, as I've said before, building DFAs once per thread is horribly wasteful of memory. In my experience with Quamina's predecessor, basically all the production deployments were multi-threaded.

Need to get this landed for me to make any progress on DFA budgeting because every freaking piece of nontrivial FA-generating code needs to be refactored to pass a MemoryHeadroom counter back and forth.

@sayrer
Copy link
Contributor Author

sayrer commented Feb 26, 2026

Anyhow, sorry

I don't understand Canadian :)

building DFAs once per thread is horribly wasteful of memory. In my experience with Quamina's predecessor, basically all the production deployments were multi-threaded.

Oh, that's in there. I do a lot of sloppy ones to see if an idea is worth exploring. I have the memory footprint one already, but it needs the good NFA and eager DFA first. The problem naturally arises when you try to do a memory budget.

You're right to insist on clean patches (do sqaush-and-merge), but I check them out with messy ones before polishing a turd.

It's also OK to insist that the main branch is pristine. But the natural result will be PRs and branches that iterate way faster, unless people don't share their work.

(Ed: the squash and merge option should be an option in the green button at the bottom, that makes this PR one commit in Quamina's git history)

Copy link
Owner

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm done, if you could glance at the non-resolved review issues.

@sayrer
Copy link
Contributor Author

sayrer commented Feb 28, 2026

OK, these all seem reasonable. I'll get them in the morning.

For the lazy DFA design, you can get it to be a global cache, and this is better in some ways. Once one goroutine discovers an optimization, everyone gets it. I hacked it out in that branch I would not submit as a PR. I think there would need to be an intermediate stage where the lazy DFA is wastefully per-goroutine (as you point out). But this is just an investigation: can you get it faster by doing this? I don't know for sure because the NFA is getting better, but I suspect so. Once that is proven, then there's a tricky global cache management problem. I think it's not possible to aim for the end state right away, since it will be unreviewable. But that's where my head is. It has to be in distinct steps: first verify that the DFA actually wins while wasting memory, then see if it still wins with synchronization wrappers.

sayrer and others added 2 commits February 28, 2026 08:43
- Move closureBuffers type declaration above first use, add field comments
- Add comment documenting order-dependent fieldTransitions comparison
- Rename closureForStateWithBufs → closureForState (test helper → closureForStateNoBufs)
- Rename flattenEpsilonTargets/flattenCollect → simplifySplices/simplifyCollect
- Rename closureRepGen → closureGen (symmetrical with lastVisitedGen)
- Update dedup test to use overlapping fieldTransitions in different order

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@timbray timbray merged commit d915467 into timbray:main Feb 28, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants