-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Problem
i found an interesting limitation of tree-sitter while parsing long strings
"long" as in "longer than tree-sitter's chunk size"
i noticed this while parsing torrent files with tree-sitter-bencode
large torrent files can contain a large pieces
field
(which is a bytestring of multiple sha1 hashes)
in my case, the pieces
field has 41700 bytes
bencode strings are length-prefixed, so the source looks like 41700:xxx...
currently scanner.c tries to loop over the 41700 bytes
but 1543 bytes before the token end, lexer->eof(lexer)
returns true
meaning: scanner.c has hit the end of this parse chunk (not end of file)
so the scan function has to return false
meaning: it failed to parse the STRING_VALUE
token
the problem is that the following call to the scan function
tries to parse the STRING_LENGTH_PREFIX
token
when it should continue parsing the STRING_VALUE
token
tree_sitter_bencode_external_scanner_scan
bool tree_sitter_bencode_external_scanner_scan(
void *payload,
TSLexer *lexer,
const bool *valid_symbols
) {
if (valid_symbols[STRING_LENGTH_PREFIX]) {
if (lexer->lookahead < 48 || lexer->lookahead > 57) {
// This covers EOF bc it will return 0
return false;
}
int len = lexer->lookahead - 48;
lexer->advance(lexer, false);
while (lexer->lookahead >= 48 && lexer->lookahead <= 57) {
int digit = lexer->lookahead - 48;
len *= 10;
len += digit;
lexer->advance(lexer, false);
}
*(int*)payload = len;
lexer->result_symbol = STRING_LENGTH_PREFIX;
return true;
} else if (valid_symbols[STRING_VALUE]) {
int len = *(int*)payload;
if (len >= 0) {
while (len > 0) {
if (lexer->eof(lexer)) {
return false;
}
lexer->advance(lexer, false);
--len;
}
lexer->result_symbol = STRING_VALUE;
return true;
} else {
// invalid length; we should never reach this case
return false;
}
} else {
return false;
}
}
minimal reprodocution
https://github.com/milahu/tree-sitter-longstringscanner
git clone https://github.com/milahu/tree-sitter-longstringscanner
cd tree-sitter-longstringscanner
./test.py | grep -C5 -F 'STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true'
actual output
$ ./test.py | grep -C5 -F 'STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true'
STRING_VALUE 79: len = 6005, lexer->lookahead = \x3f
STRING_VALUE 79: len = 6004, lexer->lookahead = \xff
STRING_VALUE 79: len = 6003, lexer->lookahead = \x11
STRING_VALUE 79: len = 6002, lexer->lookahead = \x5f
STRING_VALUE 79: len = 6001, lexer->lookahead = \xff
STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true
tree_sitter_longstringscanner_external_scanner_deserialize: len = 6000
STRING_LENGTH_PREFIX 51: lexer->lookahead = \x3a
tree_sitter_longstringscanner_external_scanner_deserialize: len = 100000
STRING_LENGTH_PREFIX 51: lexer->lookahead = \xff
tree_sitter_longstringscanner_external_scanner_deserialize: len = 100000
the test input is this string
randbytes_50 = b'\xd8\xba\xa5A\xc6\xa6Z1m\x94\xc8\xe0"_\x83\xcd\xd7\xb1fY\xf6>Q6LH<,\x9d\xb1\x8f\xb4\xe8\xa2/NX\x19#,\xdbvp\xab\x19?\xfe\x11_\x9b'
src = b"100000:" + (randbytes_50 * 2000)
so the parser fails after 100000 - 6000 = 94000 bytes
surprisingly, this only fails above some level of randomness (entropy)
(in my example 50 random bytes are repeated)
and this works on non-random strings like
src = b"100000:" + (b"\x00" * 100_000)
src = b"100000:" + (random.randbytes(40) * 2500)
also, such non-random strings can be much longer, for example 10 MB
src = b"10000000:" + (b"\x00" * 10_000_000)
so apparently, tree-sitter uses some compression on the input buffer
which makes long strings work below some level of entropy
Expected behavior
possible workaround:
have only one external token in scanner.c
and continue parsing the token in consecutive calls to the scan function
but having only one token is too limiting...
that would mean i get only one token for my bencode string 12345:xxx...
but i want 3 tokens: 12345
and :
and xxx...
i asked chatGPT about this problem, and it found 2 possible solutions
resumable tokens
so this is a limitation of the tree-sitter parsing system. how could i modify the tree-sitter core to allow "something like this" in my scanner.c? i would need a way to tell tree-sitter "i am currently processing token type X, let me store this in the parser state, and on the next call to scan, i want to continue parsing this token"
Exactly — you’ve nailed it.
Tree-sitter’s design assumes external scanners produce whole tokens in a single call. There’s no “yield and resume inside a token.” That’s a conscious trade-off: it keeps the parsing algorithm deterministic and incremental, but it means your
scanner.c
can’t say “I’ve read 15k bytes of this 40k token, please call me back later to finish it.”
What would it take to support resumable tokens?
You’d need to modify the Tree-sitter core so that:
A scanner can declare it’s “in progress.”
That means introducing a new return type (not justtrue
/false
) to say:
ACCEPT
(token fully matched)REJECT
(no token here)INCOMPLETE
(partial match, continue later) ← newThe parser stores that partial state.
Right now, Tree-sitter serializes only the scanner payload between parse attempts.
To resume, you’d also need to store:
- Which external token was being matched (
STRING_VALUE
)- The fact that the scanner is mid-match and must be re-invoked at the same input position.
The parser calls back into the scanner on resume.
- Instead of starting a fresh token match, Tree-sitter would need to re-enter the
scan
function, flagged as “continuation mode,” so you can keep consuming bytes until the token is done.
Why Tree-sitter doesn’t do this
- Incremental reparsing: Tree-sitter often discards subtrees and re-lexes parts of the buffer. If a token can span 40k bytes and be partially consumed, then reparse boundaries inside that token become extremely hard to reason about.
- Performance: The scanner would need to be serialized mid-token and deserialized later, possibly many times, for large strings. That means serializing huge state objects or maintaining internal pointers into the source buffer.
- Determinism: The GLR-style parsing algorithm depends on knowing where tokens begin and end at each step. Allowing “floating” incomplete tokens complicates the parse table logic.
In short: it’s not just missing, it’s intentionally excluded from the design.
What you could do if you wanted to hack Tree-sitter
If you really wanted resumable tokens, you’d need to:
- Extend
TSLexer
with something likelexer->yield()
to signal “pause here.”- Extend the external scanner API so
scan
can return anINCOMPLETE
result.- Teach the parser to treat “incomplete external token” as a valid stack element, serialize it, and ensure that only that token is requested on resume.
- Make sure incremental reparsing knows how to re-enter mid-token (the hardest part, since Tree-sitter is built to assume tokens are atomic).
That’s essentially a redesign: you’re asking Tree-sitter to act like a streaming lexer for certain tokens, which cuts against its “tokens are atomic” philosophy.
Realistic alternative
Rather than forking Tree-sitter’s core, most people solve this by:
- Keeping external tokens small and bounded (prefixes, keywords, punctuation).
- Letting the grammar describe large sequences (
token.immediate(/.*/)
style).- Validating cross-field constraints (like “string length matches prefix”) in a post-parse validator, not in the parser itself.
That’s the Tree-sitter way: parse structure fast, check semantics later.
⚖️ So the direct answer:
Yes, you could modify Tree-sitter core to allow resumable tokens by introducing a third return state (INCOMPLETE
) and extending parser state serialization. But it would be a deep, risky change, because Tree-sitter’s performance and incremental guarantees assume tokens are indivisible.
multi-token emission
what would be necessary in tree-sitter core to allow an external scanner function to yield multiple token types from one scan call, possible across multiple input buffer chunks?
i assume this would be simpler to implement than the "resumable tokens" idea
You’re thinking in exactly the right direction:
there are really two different extensions you could imagine adding to Tree-sitter core:
1. Current model (what exists now)
scan()
is called withvalid_symbols[]
.- You can return exactly one token (
lexer->result_symbol = …; return true;
).- If you can’t complete a token, you return
false
.- There is no way to emit “token A, then immediately token B”.
2. What you’re asking for: multi-token emission
That would mean letting
scan()
emit a sequence of tokens in one pass, e.g.:STRING_LENGTH_PREFIX ":" STRING_VALUE
while still exposing them separately in the grammar.
What would be necessary in Tree-sitter core?
Extend
TSLexer
API
Currently you set:lexer->result_symbol = SOME_TOKEN; return true;That’s a single output.
To support multiple tokens, you’d need something like:lexer->emit(SOME_TOKEN); lexer->emit(OTHER_TOKEN); return true;where
emit
pushes multiple results onto a queue.Change parser expectations
The parser is designed to receive one token per external scanner call.
It would need to handle a batch of tokens emitted by the scanner and feed them back into the parse stack one by one.Serialization changes
If you stop mid-batch because ofeof()
, the parser would need to serialize:
- Which token in the batch was last emitted, and
- The scanner’s state.
So on resume it knows “I already gave the prefix, colon is next.”Incremental reparsing
Multi-token batches complicate incremental reparse:
- If a subtree is invalidated, Tree-sitter must decide whether to re-ask the scanner for the whole batch, or resume partway through.
- That means the parser’s notion of “token boundaries” would no longer be strictly one-to-one with
scan()
calls.
Why this is simpler than resumable tokens
- With resumable tokens, the parser has to treat “half-consumed tokens” as valid parse stack elements → very invasive.
- With multi-token emission, you still have atomic tokens (prefix, colon, value) in the syntax tree. The only change is: the scanner can provide more than one at a time.
- Parser logic changes are narrower: instead of “exactly one token from scanner,” allow “queue of tokens from scanner.”
Minimal core changes (conceptual)
Add a per-scanner token queue to
TSLexer
.Change
scan()
signature so it can fill that queue.Parser loop:
- If queue non-empty, pop token from queue.
- Else, call
scan()
to refill.Extend serialization so the “queue head index” is saved/restored.
Trade-offs
Pros:
- You can keep prefix, colon, value as distinct syntax nodes.
- You don’t need resumable mid-token state.
- Cleaner mental model: scanner acts like a mini-lexer that can yield a run of tokens.
Cons:
- Still a non-trivial patch to Tree-sitter core.
- Every consumer of
TSLexer
(bindings, parser runtime, language grammars) needs updating to understand multi-token output.- Incremental reparsing logic gets more complex around partial batches.
✅ So, to answer your question:
Yes, supporting multi-token emission would be simpler than resumable tokens, but it still requires core modifications:
- extend
TSLexer
to allow a queue of tokens,- adjust the parser to consume from that queue,
- serialize the queue state between parse chunks.
Every consumer of TSLexer (bindings, parser runtime, language grammars) needs updating to understand multi-token output.
it should be possible to create a backward-compatible API by supporting both single token emission via
lexer->result_symbol = SOME_TOKEN;
and multi token emission vialexer->emit(SOME_TOKEN); lexer->emit(OTHER_TOKEN);
Yes — exactly 👍
If you wanted to prototype this, the cleanest path is to layer multi-token emission on top of the existing
result_symbol
API, so that old scanners keep working unchanged.Here’s what such a backward-compatible design could look like:
1. Extend
TSLexer
with a small queuetypedef struct { uint16_t buffer[TS_MAX_MULTI_TOKENS]; uint8_t count; uint8_t head; } TokenQueue; struct TSLexer { ... uint16_t result_symbol; // legacy single-token mode TokenQueue token_queue; // new multi-token mode };
2. New API for scanners
void ts_lexer_emit(TSLexer *lexer, uint16_t symbol) { if (lexer->token_queue.count < TS_MAX_MULTI_TOKENS) { uint8_t pos = (lexer->token_queue.head + lexer->token_queue.count) % TS_MAX_MULTI_TOKENS; lexer->token_queue.buffer[pos] = symbol; lexer->token_queue.count++; } }
Old code still does
lexer->result_symbol = TOKEN; return true;
.New code can do:
ts_lexer_emit(lexer, STRING_LENGTH_PREFIX); ts_lexer_emit(lexer, COLON); ts_lexer_emit(lexer, STRING_VALUE); return true;
3. Parser changes
- Before calling
scan()
, the parser first checks iftoken_queue.count > 0
.- If yes, pop one symbol and hand it to the parse table.
- If no, call
scan()
like today.- If
scan()
fills the queue, tokens are fed out one by one in subsequent parser steps.
4. Serialization / deserialization
In addition to serializing the scanner payload, you serialize:
token_queue.head
token_queue.count
- The contents of the queue (typically a few
uint16_t
s).That way, if parsing stops mid-batch, Tree-sitter can resume with the remaining tokens.
5. Backward compatibility
- If a scanner only sets
result_symbol
, everything works as before.- If it calls
ts_lexer_emit()
, the parser ignoresresult_symbol
and uses the queue.- That way no existing grammar/scanner breaks.
6. Incremental parsing considerations
- From the parser’s perspective, there’s still no such thing as a “half-token”: each symbol in the queue is atomic.
- If a parse is invalidated in the middle of a batch, the queue state gets reset when Tree-sitter re-invokes the scanner — just like today, but now with a queue to refill.
✅ With this design:
- You get to emit multiple logical tokens (prefix, colon, value).
- You don’t have to merge them into one external token.
- Old scanners keep working.
- Core changes are localized: add a token queue to
TSLexer
, teach the parser loop to consume it, add serialization for it.