Skip to content

parse multiple long tokens in scanner.c #4857

@milahu

Description

@milahu

Problem

i found an interesting limitation of tree-sitter while parsing long strings
"long" as in "longer than tree-sitter's chunk size"

i noticed this while parsing torrent files with tree-sitter-bencode
large torrent files can contain a large pieces field
(which is a bytestring of multiple sha1 hashes)
in my case, the pieces field has 41700 bytes
bencode strings are length-prefixed, so the source looks like 41700:xxx...

currently scanner.c tries to loop over the 41700 bytes
but 1543 bytes before the token end, lexer->eof(lexer) returns true
meaning: scanner.c has hit the end of this parse chunk (not end of file)
so the scan function has to return false
meaning: it failed to parse the STRING_VALUE token

the problem is that the following call to the scan function
tries to parse the STRING_LENGTH_PREFIX token
when it should continue parsing the STRING_VALUE token

tree_sitter_bencode_external_scanner_scan
bool tree_sitter_bencode_external_scanner_scan(
  void *payload,
  TSLexer *lexer,
  const bool *valid_symbols
) {
    if (valid_symbols[STRING_LENGTH_PREFIX]) {
        if (lexer->lookahead < 48 || lexer->lookahead > 57) {
            // This covers EOF bc it will return 0
            return false;
        }
        int len = lexer->lookahead - 48;
        lexer->advance(lexer, false);
        while (lexer->lookahead >= 48 && lexer->lookahead <= 57) {
            int digit = lexer->lookahead - 48;
            len *= 10;
            len += digit;
            lexer->advance(lexer, false);
        }
        *(int*)payload = len;
        lexer->result_symbol = STRING_LENGTH_PREFIX;
        return true;
    } else if (valid_symbols[STRING_VALUE]) {
        int len = *(int*)payload;
        if (len >= 0) {
            while (len > 0) {
                if (lexer->eof(lexer)) {
                    return false;
                }
                lexer->advance(lexer, false);
                --len;
            }
            lexer->result_symbol = STRING_VALUE;
            return true;
        } else {
            // invalid length; we should never reach this case
            return false;
        }
    } else {
        return false;
    }
}

minimal reprodocution

https://github.com/milahu/tree-sitter-longstringscanner

git clone https://github.com/milahu/tree-sitter-longstringscanner
cd tree-sitter-longstringscanner
./test.py | grep -C5 -F 'STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true'

actual output

$ ./test.py | grep -C5 -F 'STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true'
STRING_VALUE 79: len = 6005, lexer->lookahead = \x3f
STRING_VALUE 79: len = 6004, lexer->lookahead = \xff
STRING_VALUE 79: len = 6003, lexer->lookahead = \x11
STRING_VALUE 79: len = 6002, lexer->lookahead = \x5f
STRING_VALUE 79: len = 6001, lexer->lookahead = \xff
STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true
tree_sitter_longstringscanner_external_scanner_deserialize: len = 6000
STRING_LENGTH_PREFIX 51: lexer->lookahead = \x3a
tree_sitter_longstringscanner_external_scanner_deserialize: len = 100000
STRING_LENGTH_PREFIX 51: lexer->lookahead = \xff
tree_sitter_longstringscanner_external_scanner_deserialize: len = 100000

the test input is this string

randbytes_50 = b'\xd8\xba\xa5A\xc6\xa6Z1m\x94\xc8\xe0"_\x83\xcd\xd7\xb1fY\xf6>Q6LH<,\x9d\xb1\x8f\xb4\xe8\xa2/NX\x19#,\xdbvp\xab\x19?\xfe\x11_\x9b'
src = b"100000:" + (randbytes_50 * 2000)

so the parser fails after 100000 - 6000 = 94000 bytes

surprisingly, this only fails above some level of randomness (entropy)
(in my example 50 random bytes are repeated)
and this works on non-random strings like

src = b"100000:" + (b"\x00" * 100_000)
src = b"100000:" + (random.randbytes(40) * 2500)

also, such non-random strings can be much longer, for example 10 MB

src = b"10000000:" + (b"\x00" * 10_000_000)

so apparently, tree-sitter uses some compression on the input buffer
which makes long strings work below some level of entropy

Expected behavior

possible workaround:
have only one external token in scanner.c
and continue parsing the token in consecutive calls to the scan function

but having only one token is too limiting...
that would mean i get only one token for my bencode string 12345:xxx...
but i want 3 tokens: 12345 and : and xxx...

i asked chatGPT about this problem, and it found 2 possible solutions

resumable tokens

so this is a limitation of the tree-sitter parsing system. how could i modify the tree-sitter core to allow "something like this" in my scanner.c? i would need a way to tell tree-sitter "i am currently processing token type X, let me store this in the parser state, and on the next call to scan, i want to continue parsing this token"

Exactly — you’ve nailed it.

Tree-sitter’s design assumes external scanners produce whole tokens in a single call. There’s no “yield and resume inside a token.” That’s a conscious trade-off: it keeps the parsing algorithm deterministic and incremental, but it means your scanner.c can’t say “I’ve read 15k bytes of this 40k token, please call me back later to finish it.”


What would it take to support resumable tokens?

You’d need to modify the Tree-sitter core so that:

  1. A scanner can declare it’s “in progress.”
    That means introducing a new return type (not just true/false) to say:

    • ACCEPT (token fully matched)
    • REJECT (no token here)
    • INCOMPLETE (partial match, continue later) ← new
  2. The parser stores that partial state.

    • Right now, Tree-sitter serializes only the scanner payload between parse attempts.

    • To resume, you’d also need to store:

      • Which external token was being matched (STRING_VALUE)
      • The fact that the scanner is mid-match and must be re-invoked at the same input position.
  3. The parser calls back into the scanner on resume.

    • Instead of starting a fresh token match, Tree-sitter would need to re-enter the scan function, flagged as “continuation mode,” so you can keep consuming bytes until the token is done.

Why Tree-sitter doesn’t do this

  • Incremental reparsing: Tree-sitter often discards subtrees and re-lexes parts of the buffer. If a token can span 40k bytes and be partially consumed, then reparse boundaries inside that token become extremely hard to reason about.
  • Performance: The scanner would need to be serialized mid-token and deserialized later, possibly many times, for large strings. That means serializing huge state objects or maintaining internal pointers into the source buffer.
  • Determinism: The GLR-style parsing algorithm depends on knowing where tokens begin and end at each step. Allowing “floating” incomplete tokens complicates the parse table logic.

In short: it’s not just missing, it’s intentionally excluded from the design.


What you could do if you wanted to hack Tree-sitter

If you really wanted resumable tokens, you’d need to:

  • Extend TSLexer with something like lexer->yield() to signal “pause here.”
  • Extend the external scanner API so scan can return an INCOMPLETE result.
  • Teach the parser to treat “incomplete external token” as a valid stack element, serialize it, and ensure that only that token is requested on resume.
  • Make sure incremental reparsing knows how to re-enter mid-token (the hardest part, since Tree-sitter is built to assume tokens are atomic).

That’s essentially a redesign: you’re asking Tree-sitter to act like a streaming lexer for certain tokens, which cuts against its “tokens are atomic” philosophy.


Realistic alternative

Rather than forking Tree-sitter’s core, most people solve this by:

  • Keeping external tokens small and bounded (prefixes, keywords, punctuation).
  • Letting the grammar describe large sequences (token.immediate(/.*/) style).
  • Validating cross-field constraints (like “string length matches prefix”) in a post-parse validator, not in the parser itself.

That’s the Tree-sitter way: parse structure fast, check semantics later.


⚖️ So the direct answer:
Yes, you could modify Tree-sitter core to allow resumable tokens by introducing a third return state (INCOMPLETE) and extending parser state serialization. But it would be a deep, risky change, because Tree-sitter’s performance and incremental guarantees assume tokens are indivisible.

multi-token emission

what would be necessary in tree-sitter core to allow an external scanner function to yield multiple token types from one scan call, possible across multiple input buffer chunks?

i assume this would be simpler to implement than the "resumable tokens" idea

You’re thinking in exactly the right direction:
there are really two different extensions you could imagine adding to Tree-sitter core:


1. Current model (what exists now)

  • scan() is called with valid_symbols[].
  • You can return exactly one token (lexer->result_symbol = …; return true;).
  • If you can’t complete a token, you return false.
  • There is no way to emit “token A, then immediately token B”.

2. What you’re asking for: multi-token emission

That would mean letting scan() emit a sequence of tokens in one pass, e.g.:

STRING_LENGTH_PREFIX
":"
STRING_VALUE

while still exposing them separately in the grammar.


What would be necessary in Tree-sitter core?

  1. Extend TSLexer API
    Currently you set:

    lexer->result_symbol = SOME_TOKEN;
    return true;

    That’s a single output.
    To support multiple tokens, you’d need something like:

    lexer->emit(SOME_TOKEN);
    lexer->emit(OTHER_TOKEN);
    return true;

    where emit pushes multiple results onto a queue.

  2. Change parser expectations
    The parser is designed to receive one token per external scanner call.
    It would need to handle a batch of tokens emitted by the scanner and feed them back into the parse stack one by one.

  3. Serialization changes
    If you stop mid-batch because of eof(), the parser would need to serialize:

    • Which token in the batch was last emitted, and
    • The scanner’s state.
      So on resume it knows “I already gave the prefix, colon is next.”
  4. Incremental reparsing
    Multi-token batches complicate incremental reparse:

    • If a subtree is invalidated, Tree-sitter must decide whether to re-ask the scanner for the whole batch, or resume partway through.
    • That means the parser’s notion of “token boundaries” would no longer be strictly one-to-one with scan() calls.

Why this is simpler than resumable tokens

  • With resumable tokens, the parser has to treat “half-consumed tokens” as valid parse stack elements → very invasive.
  • With multi-token emission, you still have atomic tokens (prefix, colon, value) in the syntax tree. The only change is: the scanner can provide more than one at a time.
  • Parser logic changes are narrower: instead of “exactly one token from scanner,” allow “queue of tokens from scanner.”

Minimal core changes (conceptual)

  • Add a per-scanner token queue to TSLexer.

  • Change scan() signature so it can fill that queue.

  • Parser loop:

    • If queue non-empty, pop token from queue.
    • Else, call scan() to refill.
  • Extend serialization so the “queue head index” is saved/restored.


Trade-offs

  • Pros:

    • You can keep prefix, colon, value as distinct syntax nodes.
    • You don’t need resumable mid-token state.
    • Cleaner mental model: scanner acts like a mini-lexer that can yield a run of tokens.
  • Cons:

    • Still a non-trivial patch to Tree-sitter core.
    • Every consumer of TSLexer (bindings, parser runtime, language grammars) needs updating to understand multi-token output.
    • Incremental reparsing logic gets more complex around partial batches.

✅ So, to answer your question:
Yes, supporting multi-token emission would be simpler than resumable tokens, but it still requires core modifications:

  • extend TSLexer to allow a queue of tokens,
  • adjust the parser to consume from that queue,
  • serialize the queue state between parse chunks.

Every consumer of TSLexer (bindings, parser runtime, language grammars) needs updating to understand multi-token output.

it should be possible to create a backward-compatible API by supporting both single token emission via lexer->result_symbol = SOME_TOKEN; and multi token emission via lexer->emit(SOME_TOKEN); lexer->emit(OTHER_TOKEN);

Yes — exactly 👍

If you wanted to prototype this, the cleanest path is to layer multi-token emission on top of the existing result_symbol API, so that old scanners keep working unchanged.

Here’s what such a backward-compatible design could look like:


1. Extend TSLexer with a small queue

typedef struct {
  uint16_t buffer[TS_MAX_MULTI_TOKENS];
  uint8_t count;
  uint8_t head;
} TokenQueue;

struct TSLexer {
  ...
  uint16_t result_symbol;  // legacy single-token mode
  TokenQueue token_queue;  // new multi-token mode
};

2. New API for scanners

void ts_lexer_emit(TSLexer *lexer, uint16_t symbol) {
  if (lexer->token_queue.count < TS_MAX_MULTI_TOKENS) {
    uint8_t pos = (lexer->token_queue.head + lexer->token_queue.count) % TS_MAX_MULTI_TOKENS;
    lexer->token_queue.buffer[pos] = symbol;
    lexer->token_queue.count++;
  }
}
  • Old code still does lexer->result_symbol = TOKEN; return true;.

  • New code can do:

    ts_lexer_emit(lexer, STRING_LENGTH_PREFIX);
    ts_lexer_emit(lexer, COLON);
    ts_lexer_emit(lexer, STRING_VALUE);
    return true;

3. Parser changes

  • Before calling scan(), the parser first checks if token_queue.count > 0.
  • If yes, pop one symbol and hand it to the parse table.
  • If no, call scan() like today.
  • If scan() fills the queue, tokens are fed out one by one in subsequent parser steps.

4. Serialization / deserialization

  • In addition to serializing the scanner payload, you serialize:

    • token_queue.head
    • token_queue.count
    • The contents of the queue (typically a few uint16_ts).
  • That way, if parsing stops mid-batch, Tree-sitter can resume with the remaining tokens.


5. Backward compatibility

  • If a scanner only sets result_symbol, everything works as before.
  • If it calls ts_lexer_emit(), the parser ignores result_symbol and uses the queue.
  • That way no existing grammar/scanner breaks.

6. Incremental parsing considerations

  • From the parser’s perspective, there’s still no such thing as a “half-token”: each symbol in the queue is atomic.
  • If a parse is invalidated in the middle of a batch, the queue state gets reset when Tree-sitter re-invokes the scanner — just like today, but now with a queue to refill.

✅ With this design:

  • You get to emit multiple logical tokens (prefix, colon, value).
  • You don’t have to merge them into one external token.
  • Old scanners keep working.
  • Core changes are localized: add a token queue to TSLexer, teach the parser loop to consume it, add serialization for it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    external-scannerRelated to external token scanners

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions