kaizen: fix duplicate states during closure traversal#494
kaizen: fix duplicate states during closure traversal#494timbray merged 37 commits intotimbray:mainfrom
Conversation
Nested quantifiers like (([abc]?)*)+ create epsilon loops that cause duplicate states to compound exponentially. The previous fix used sort+compact at a threshold of 500 states. This replaces it with an O(n) in-place dedup using a per-traversal generation counter, lowering the threshold to 64 to catch growth earlier. Zero overhead for the common case since the dedup only activates when nextStates exceeds 64. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in epsilon closure Two faState pointers can share the same *smallTable (e.g. the + quantifier wraps an inner state's table in a new faState). During epsilon closure, both get added as distinct states producing duplicate destinations that compound exponentially. Dedup by tracking seen tables in closureBuffers, skipping states whose table was already visited. This removes the need for the runtime generation-counter dedup in traverseNFA. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nter dedup Replace tableSeen map in closureBuffers with a global generation counter (tableSeenGeneration) and a per-smallTable tableSeenGen field, eliminating the map allocation and clear overhead. Cache the active level's set pointer in transmap.activeSet on push(), removing the tm.levels[tm.depth].set indirection from the hot path in transmap.add(). This was causing a ~7% regression in ShellstyleMultiMatch due to the slice index + struct field dereference on every call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve transmap struct conflict: take main's flat [][]*fieldMatcher design with separate fieldSet on nfaBuffers, consistent with the auto-resolved traverseNFA and n2dNode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Have the benchmark output been updated following on all those additional pushes? Because they're kind of disappointing as written. |
Track the representative faState per smallTable instead of just a bool. When a second state shares the same table, compare fieldTransitions: if they match, skip (fast path); if they differ, include the state in the closure to preserve correctness. Add TestTablePointerDedupPreservesFieldTransitions to cover the case where two faState nodes share a smallTable with different field matchers, verifying epsilon closure, NFA traversal, and DFA conversion all preserve both matchers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain in closureBuffers why tableRep exists, what invariant it relies on (same *smallTable implies same state in current merge paths), and how the defense works when the invariant is violated (include the state, skip recursion). Add function-level doc to traverseEpsilons describing the dedup and fallback behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Yeah, they aren't good. But they do show we're in the right department, and there is a new test at least. I was looking in the context of DFA conversion https://github.com/sayrer/quamina/tree/lazy_dfa. (I will keep hunting tomorrow, but negative results are valuable too) |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merges 12 multi-wildcard shellstyle + 6 pathological regexp patterns on the same field, exercising the table-pointer dedup in traverseEpsilons. Shows 69% speedup vs main (27.7µs → 8.5µs per op). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies match results for the merged shell-style + regexp pattern mix that exercises table-pointer dedup. Ground truth confirmed identical on both main and dedup_fix branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the removed >500 sort+compact with a map-based dedup that fires when nextStates exceeds 64 entries. This prevents combinatorial blowup when many closure states step to the same next faState, without adding overhead in the common case. Also adds heavy-pattern correctness and timeout tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestBreak500Limit exercises 2925 overlapping wildcard patterns with varied input strategies. TestMeasureNextStates instruments per-byte NFA traversal to measure state expansion and dedup ratios. Both skip under -short due to pattern build time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
TestBreak500Limit and TestMeasureNextStates build 2925 patterns, which is too slow for CI under the race detector. Run them with go test -tags stress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Got to this summary for "This benchmark only has ~25 patterns. The automaton is small enough that the epsilon closures are trivially sized — no shared-table dedup opportunities. The table-pointer dedup is targeting a specific scenario: after many merges, spinner states create splice states pointing to the same underlying tables. With 25 patterns there aren't enough merges to create significant duplication. To find a real win, we'd need a benchmark with hundreds of merged shell-style patterns on the same field. That's what lazy_dfa's ShellstyleManyMatchers does — and it showed gains at higher pattern counts. This branch doesn't have The honest conclusion: the table-pointer dedup on dedup_fix is a correctness/safety improvement (with the defense) but doesn't produce measurable throughput gains on any existing benchmark. The build-time closure optimization is |
|
The benchmark that most stresses out the epsilons
is TestShellStyleBuildTime(); glance at it and you can see how it’s easy to
adjust the size/complexity up and down. The one that actually forced me
add that dedupe-at-500 count is TestToxicStack. -T
…On Feb 18, 2026 at 11:00:32 AM, RS ***@***.***> wrote:
*sayrer* left a comment (timbray/quamina#494)
<#494 (comment)>
Got to this summary for BenchmarkShellstyleMultiMatch, so I think we want
to take some of these benchmarks with the ones in #492
<#492>, and then revisit. I think
this and any nfa2dfa approach will stack up:
"This benchmark only has ~25 patterns. The automaton is small enough that
the epsilon closures are trivially sized — no shared-table dedup
opportunities.
The table-pointer dedup is targeting a specific scenario: after many
merges, spinner states create splice states pointing to the same underlying
tables. With 25 patterns there aren't enough merges to create significant
duplication.
To find a real win, we'd need a benchmark with hundreds of merged
shell-style patterns on the same field. That's what lazy_dfa's
ShellstyleManyMatchers does — and it showed gains at higher pattern counts.
This branch doesn't have
that benchmark though.
The honest conclusion: the table-pointer dedup on dedup_fix is a
correctness/safety improvement (with the defense) but doesn't produce
measurable throughput gains on any existing benchmark. The build-time
closure optimization is
real but the closures in these tests are already small. Want to add a
stress benchmark, or is this branch good as-is?"
—
Reply to this email directly, view it on GitHub
<#494 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEJE3RHDSQXY3PAMSHAYL4MSZFBAVCNFSM6AAAAACVJZXHOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSMRSGU4TSNZWGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Tests closure sizes and matching speed across workloads with nested quantifier regexps and overlapping character classes, where the table-pointer dedup has the most impact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Now we're getting somewhere with this test I added called |
Remove exploratory/proof-of-concept test files (dedup_500_test.go, dedup_bench_test.go, dedup_measure_test.go, dedup_timeout_test.go). Replace TestEpsilonClosureSizes with TestTablePointerDedup which asserts max closure bounds and expected match counts. Add BenchmarkTablePointerDedup using testing.B.Loop() for the same workloads. Retained: TestPathologicalCorrectness (dedup_correctness_test.go) and TestTablePointerDedupPreservesFieldTransitions (epsi_closure_test.go). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Here are the comparison results (median of 3 runs, 3s benchtime each):
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestTablePointerDedup already covers the same patterns and verifies both match correctness and structural properties. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This PR is starting to feel like buying a PC in the 1990s - wait a little while and next month's model will be way better. My Quamina time has been rather limited, so I haven't been keeping a close eye on this. Let me know when you think you've got it where you want it. |
Replace the local map in the dedup pass with generation-based tracking using the existing lastVisitedGen field. The NFA walk compares against a snapshotted generation while the dedup pass increments and compares against the current value, so they share the field without interference. This eliminates the per-closure map allocation while keeping only one generation field (plus closureRep pointer) on smallTable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Facts. We can jam this into less memory with no perf loss. |
|
|
This seems to cost on patterns/sec. I will look again later. |
Replace the map[*faState]bool used for closure set membership in traverseEpsilons with a generation counter on faState. This eliminates map allocations and hash operations in the hot path, improving Patterns/sec by ~55% (3170 -> 4890) with no impact on match throughput. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
By "merging the generations" you mean using the same smallTable field for deduping both in epsi_closure and FA construction? That is shocking, makes absolutely no sense to me. Any idea why? |
|
Here's a good explanation from Claude. The answer is we do need both closureRepGen and lastVisitedGen, which you asked about above. For the big win, that was me looking at a flamegraph, not Claude. "Sure. Here's the walkthrough: Setup: epsilonClosure increments closureGeneration once and snapshots it into bufs.generation. The NFA walk (closureForNfa) then uses table.lastVisitedGen == bufs.generation to skip already-visited tables. The problem: As the NFA walk progresses, it calls closureForStateWithBufs for each state. That function increments closureGeneration twice (once for the closure set pass, once for the dedup pass). In the s.table.lastVisitedGen = closureGeneration // now a NEWER value than bufs.generation This overwrites the value the NFA walk originally set. So when closureForNfa later encounters that same table (e.g. via a different path in the graph), it checks: if table.lastVisitedGen == bufs.generation // false! dedup overwrote it with a newer value The check fails, so the NFA walk re-enters the table. It re-iterates all table.steps and table.epsilons, calling closureForStateWithBufs on each (which early-returns since epsilonClosure is already set) and With separate closureRepGen: The dedup pass writes to s.table.closureRepGen instead, never touching lastVisitedGen. The NFA walk's visited markers stay intact, so no revisits happen. With 7409 smallTables and closures up to 900, the number of corrupted visited markers — and resulting re-traversals — adds up to that ~18% regression." |
Same approach as the closure set optimization: use closureSetGen on faState instead of allocating a map[*faState]bool per call. Eliminates map alloc/grow/rehash overhead in the merge hot path, improving Patterns/sec by ~18% (4890 -> 5750). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Figured I might as well get the other one of these. Now we have faster patterns/sec with better performance on the tougher regex cases, detected at build time. |
|
So, what we have is a significant improvement in FA build time for big complex patterns like in qnew/494> git checkout main; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; git checkout dedup_fix ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' ; go test -bench=^Benchmark8259Example$ -run=^$ | egrep '/sec' qnew/494> git checkout main ; go test -run=TestShellStyleBuildTime | grep /sec ; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; git checkout dedup_fix | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime | grep /sec; go test -run=TestShellStyleBuildTime| grep /sec
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
Patterns/sec: 3435.2
Huge-machine events/sec: 48398.7
Patterns/sec: 3452.7
Huge-machine events/sec: 47803.8
Patterns/sec: 3450.6
Huge-machine events/sec: 47277.9
Switched to branch 'dedup_fix'
Patterns/sec: 5881.2
Huge-machine events/sec: 45620.2
Patterns/sec: 5955.1
Huge-machine events/sec: 46486.5
Patterns/sec: 5915.4
Huge-machine events/sec: 46392.3 |
|
I think the match-time ones will get crushed if we start doing nfa2dfa. So, the idea is that build time pays off, but not in the current match algorithm. Let's not forget this one covers the tough regex ones too. Not a current use-case, since it was just added, but really difficult to cover. I think it's a local maximum, but I could be wrong. |
timbray
left a comment
There was a problem hiding this comment.
Nothing too material here, mostly code pedantry. Except for I realized I don't understand the generation-based deduping. Consider some of this a note to myself. Will be back for more later.
|
oh jeez. Tough crowd. I'll look at them, but please do remember that this one covers algorithmic explosions in regexen. not only getting the best performance on simple ones, but also not letting the wild ones spin out. "Threading the needle" or "Scylla and Charybdis". |
|
We have We can move that around, and get a win on both. But the Patterns/sec improvement won't be as big. Maybe we should check my lazy_dfa fork and see if that trade-off matters there. |
|
Hmm, this is perplexing. The NFA version is so fast now that the DFA is not worth doing. I would recommend taking this patch with its 2% regression on matching (that is now twice as fast) and then we can inspect more. |
|
OK, yeah. The DFA version can wing this one, so it's not worth debating 2% on the already really fast NFA version. My work is here: https://github.com/sayrer/quamina/tree/lazy_dfa It's all messy, just intended to see if the CS and implementations line up. They seem to. (Claude said "yeah, the NFA is already really good, this will be challenging") |
|
You explained once but I can't find it, so pardon the repeat: How are you testing DFA performance? |
|
Quamina's core matching engine is an NFA (nondeterministic finite automaton). For every byte of input, it tracks all active NFA states simultaneously — for shellstyle The lazy DFA converts this on-the-fly: the set of NFA states you're in becomes a single DFA state. Once you've seen the transition (stateSet, byte) → nextStateSet, you cache Key design choices:
Why it pays off The ShellstyleMultiMatch benchmark tells the story: 31.8µs → 10.3µs (68% faster). This is a warm-cache workload where the same value patterns repeat — exactly the real-world The 0 allocs/op on all hot paths means the lazy DFA adds no GC pressure once the cache is warm. The cost is memory (up to 8MB per field per goroutine) and the one-time overhead of building each DFA state on first visit. But once built, you get DFA speed: O(1) work per Where it pays off most
Where it helps less: one-shot matches against unique inputs (cache never warms), or simple exact-match patterns (NFA is already fast, few states to track). The architecture is sound. The main lever now is tuning — the 8MB budget, and potentially pattern-specific heuristics for when lazy DFA is worth enabling vs. just running ────────────────────── |
|
|
|
But none of this work is in this patch. It's only to point out that we should not sweat 2%, since we can smash it. |
|
OK, I got it, I had to do some hand-tracing. The closureRepGen just tells you whether you've seen this or not in this particular closure. I think some of the commentary is misleading. Anyhow, sorry for persistently lagging behind you, I have a bunch of boring year-end clerical work occupying bandwidth. I did drop a couple suggestions in in the N-1 review and sometime probably before this time tomorrow I'll finish my pass through. Unless I find something else that bothers me, I guess I'm bought in on the core idea of this PR. But wow, so many commits. BTW you're still going to have a tough time selling lazy DFAs because, as I've said before, building DFAs once per thread is horribly wasteful of memory. In my experience with Quamina's predecessor, basically all the production deployments were multi-threaded. Need to get this landed for me to make any progress on DFA budgeting because every freaking piece of nontrivial FA-generating code needs to be refactored to pass a MemoryHeadroom counter back and forth. |
I don't understand Canadian :)
Oh, that's in there. I do a lot of sloppy ones to see if an idea is worth exploring. I have the memory footprint one already, but it needs the good NFA and eager DFA first. The problem naturally arises when you try to do a memory budget. You're right to insist on clean patches (do sqaush-and-merge), but I check them out with messy ones before polishing a turd. It's also OK to insist that the main branch is pristine. But the natural result will be PRs and branches that iterate way faster, unless people don't share their work. (Ed: the squash and merge option should be an option in the green button at the bottom, that makes this PR one commit in Quamina's git history) |
timbray
left a comment
There was a problem hiding this comment.
OK, I'm done, if you could glance at the non-resolved review issues.
|
OK, these all seem reasonable. I'll get them in the morning. For the lazy DFA design, you can get it to be a global cache, and this is better in some ways. Once one goroutine discovers an optimization, everyone gets it. I hacked it out in that branch I would not submit as a PR. I think there would need to be an intermediate stage where the lazy DFA is wastefully per-goroutine (as you point out). But this is just an investigation: can you get it faster by doing this? I don't know for sure because the NFA is getting better, but I suspect so. Once that is proven, then there's a tricky global cache management problem. I think it's not possible to aim for the end state right away, since it will be unreviewable. But that's where my head is. It has to be in distinct steps: first verify that the DFA actually wins while wasting memory, then see if it still wins with synchronization wrappers. |
- Move closureBuffers type declaration above first use, add field comments - Add comment documenting order-dependent fieldTransitions comparison - Rename closureForStateWithBufs → closureForState (test helper → closureForStateNoBufs) - Rename flattenEpsilonTargets/flattenCollect → simplifySplices/simplifyCollect - Rename closureRepGen → closureGen (symmetrical with lastVisitedGen) - Update dedup test to use overlapping fieldTransitions in different order Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Here's what dedup_fix does relative to main:
Problem: During FA merges (especially with shell-style * patterns),
the same smallTable can end up referenced by multiple faState
nodes. When epsilon closures are computed, these duplicate table
pointers cause the same logical state to appear multiple times,
leading to redundant work during NFA traversal.
Solution — table-pointer dedup in epsilon closure
(epsi_closure.go):
alongside the existing closureSet (which is keyed by *faState).
closure, it checks whether that state's *smallTable has already
been seen. If so, the state is skipped as a duplicate.
itself is smaller, so every downstream consumer benefits.
Consequence — removal of runtime dedup in traverseNFA (nfa.go):
a safety valve for "toxically-complex regexps" where epsilon
loops caused huge nextStates buildups. With duplicates eliminated
at closure-computation time, this is no longer needed.
In short: instead of deduplicating at traversal time
(expensive, per-event), this branch deduplicates at build time
(once, when patterns are added) by recognizing that distinct
faState pointers sharing the same smallTable are logically
identical.