parse multiple long tokens in scanner.c

### Problem

i found an interesting limitation of tree-sitter while parsing long strings
"long" as in "longer than tree-sitter's chunk size"

i noticed this while parsing torrent files with [tree-sitter-bencode](https://github.com/Samasaur1/tree-sitter-bencode)
large torrent files can contain a large `pieces` field
(which is a bytestring of multiple sha1 hashes)
in my case, the `pieces` field has 41700 bytes
bencode strings are length-prefixed, so the source looks like `41700:xxx...`

currently [scanner.c](https://github.com/Samasaur1/tree-sitter-bencode/blob/033e919cbf80c260880304df89c5b7c7b14dc035/src/scanner.c#L69) tries to loop over the 41700 bytes
but 1543 bytes before the token end, `lexer->eof(lexer)` returns `true`
meaning: scanner.c has hit the end of this parse chunk (not end of file)
so the scan function has to `return false`
meaning: it failed to parse the `STRING_VALUE` token

the problem is that the following call to the scan function
tries to parse the `STRING_LENGTH_PREFIX` token
when it should continue parsing the `STRING_VALUE` token

<details>
<summary>
tree_sitter_bencode_external_scanner_scan
</summary>

```c
bool tree_sitter_bencode_external_scanner_scan(
  void *payload,
  TSLexer *lexer,
  const bool *valid_symbols
) {
    if (valid_symbols[STRING_LENGTH_PREFIX]) {
        if (lexer->lookahead < 48 || lexer->lookahead > 57) {
            // This covers EOF bc it will return 0
            return false;
        }
        int len = lexer->lookahead - 48;
        lexer->advance(lexer, false);
        while (lexer->lookahead >= 48 && lexer->lookahead <= 57) {
            int digit = lexer->lookahead - 48;
            len *= 10;
            len += digit;
            lexer->advance(lexer, false);
        }
        *(int*)payload = len;
        lexer->result_symbol = STRING_LENGTH_PREFIX;
        return true;
    } else if (valid_symbols[STRING_VALUE]) {
        int len = *(int*)payload;
        if (len >= 0) {
            while (len > 0) {
                if (lexer->eof(lexer)) {
                    return false;
                }
                lexer->advance(lexer, false);
                --len;
            }
            lexer->result_symbol = STRING_VALUE;
            return true;
        } else {
            // invalid length; we should never reach this case
            return false;
        }
    } else {
        return false;
    }
}
```

</details>


### minimal reprodocution

https://github.com/milahu/tree-sitter-longstringscanner

```sh
git clone https://github.com/milahu/tree-sitter-longstringscanner
cd tree-sitter-longstringscanner
./test.py | grep -C5 -F 'STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true'
```

actual output

```
$ ./test.py | grep -C5 -F 'STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true'
STRING_VALUE 79: len = 6005, lexer->lookahead = \x3f
STRING_VALUE 79: len = 6004, lexer->lookahead = \xff
STRING_VALUE 79: len = 6003, lexer->lookahead = \x11
STRING_VALUE 79: len = 6002, lexer->lookahead = \x5f
STRING_VALUE 79: len = 6001, lexer->lookahead = \xff
STRING_VALUE 78: len = 6000, lexer->eof(lexer) == true
tree_sitter_longstringscanner_external_scanner_deserialize: len = 6000
STRING_LENGTH_PREFIX 51: lexer->lookahead = \x3a
tree_sitter_longstringscanner_external_scanner_deserialize: len = 100000
STRING_LENGTH_PREFIX 51: lexer->lookahead = \xff
tree_sitter_longstringscanner_external_scanner_deserialize: len = 100000
```

the test input is this string

```py
randbytes_50 = b'\xd8\xba\xa5A\xc6\xa6Z1m\x94\xc8\xe0"_\x83\xcd\xd7\xb1fY\xf6>Q6LH<,\x9d\xb1\x8f\xb4\xe8\xa2/NX\x19#,\xdbvp\xab\x19?\xfe\x11_\x9b'
src = b"100000:" + (randbytes_50 * 2000)
```

so the parser fails after 100000 - 6000 = 94000 bytes

surprisingly, this only fails above some level of randomness (entropy)
(in my example 50 random bytes are repeated)
and this works on non-random strings like

```py
src = b"100000:" + (b"\x00" * 100_000)
src = b"100000:" + (random.randbytes(40) * 2500)
```

also, such non-random strings can be much longer, for example 10 MB

```py
src = b"10000000:" + (b"\x00" * 10_000_000)
```

so apparently, tree-sitter uses some compression on the input buffer
which makes long strings work below some level of entropy


### Expected behavior


possible workaround:
have only one external token in scanner.c
and continue parsing the token in consecutive calls to the scan function

but having only one token is too limiting...
that would mean i get only one token for my bencode string `12345:xxx...`
but i want 3 tokens: `12345` and `:` and `xxx...`

i asked chatGPT about this problem, and it found 2 possible solutions

<details>
<summary>
resumable tokens
</summary>

<blockquote>

> so this is a limitation of the tree-sitter parsing system. how could i modify the tree-sitter core to allow "something like this" in my scanner.c? i would need a way to tell tree-sitter "i am currently processing token type X, let me store this in the parser state, and on the next call to scan, i want to continue parsing this token"

Exactly — you’ve nailed it.

Tree-sitter’s design assumes **external scanners produce *whole* tokens in a single call**. There’s no “yield and resume inside a token.” That’s a conscious trade-off: it keeps the parsing algorithm deterministic and incremental, but it means your `scanner.c` can’t say *“I’ve read 15k bytes of this 40k token, please call me back later to finish it.”*

---

### What would it take to support resumable tokens?

You’d need to modify the Tree-sitter core so that:

1. **A scanner can declare it’s “in progress.”**
   That means introducing a new return type (not just `true`/`false`) to say:

   * `ACCEPT` (token fully matched)
   * `REJECT` (no token here)
   * **`INCOMPLETE` (partial match, continue later)** ← new

2. **The parser stores that partial state.**

   * Right now, Tree-sitter serializes only the *scanner payload* between parse attempts.
   * To resume, you’d also need to store:

     * Which external token was being matched (`STRING_VALUE`)
     * The fact that the scanner is mid-match and must be re-invoked at the same input position.

3. **The parser calls back into the scanner on resume.**

   * Instead of starting a fresh token match, Tree-sitter would need to *re-enter* the `scan` function, flagged as “continuation mode,” so you can keep consuming bytes until the token is done.

---

### Why Tree-sitter doesn’t do this

* **Incremental reparsing:** Tree-sitter often discards subtrees and re-lexes parts of the buffer. If a token can span 40k bytes and be partially consumed, then reparse boundaries inside that token become extremely hard to reason about.
* **Performance:** The scanner would need to be serialized mid-token and deserialized later, possibly many times, for large strings. That means serializing huge state objects or maintaining internal pointers into the source buffer.
* **Determinism:** The GLR-style parsing algorithm depends on knowing where tokens begin and end at each step. Allowing “floating” incomplete tokens complicates the parse table logic.

In short: it’s not just missing, it’s **intentionally excluded** from the design.

---

### What you *could* do if you wanted to hack Tree-sitter

If you really wanted resumable tokens, you’d need to:

* Extend `TSLexer` with something like `lexer->yield()` to signal “pause here.”
* Extend the external scanner API so `scan` can return an `INCOMPLETE` result.
* Teach the parser to treat “incomplete external token” as a valid stack element, serialize it, and ensure that only that token is requested on resume.
* Make sure incremental reparsing knows how to re-enter mid-token (the hardest part, since Tree-sitter is built to assume tokens are atomic).

That’s essentially a redesign: you’re asking Tree-sitter to act like a *streaming lexer* for certain tokens, which cuts against its “tokens are atomic” philosophy.

---

### Realistic alternative

Rather than forking Tree-sitter’s core, most people solve this by:

* Keeping external tokens **small and bounded** (prefixes, keywords, punctuation).
* Letting the grammar describe large sequences (`token.immediate(/.*/)` style).
* Validating cross-field constraints (like “string length matches prefix”) in a **post-parse validator**, not in the parser itself.

That’s the Tree-sitter way: parse structure fast, check semantics later.

---

⚖️ So the direct answer:
Yes, you could modify Tree-sitter core to allow resumable tokens by introducing a third return state (`INCOMPLETE`) and extending parser state serialization. But it would be a deep, risky change, because Tree-sitter’s performance and incremental guarantees assume tokens are indivisible.

</blockquote>

</details>

<details>
<summary>
multi-token emission
</summary>

<blockquote>

> what would be necessary in tree-sitter core to allow an external scanner function to yield multiple token types from one scan call, possible across multiple input buffer chunks?
>
> i assume this would be simpler to implement than the "resumable tokens" idea

You’re thinking in exactly the right direction:
there are really **two different extensions you could imagine adding to Tree-sitter core**:

---

## 1. Current model (what exists now)

* `scan()` is called with `valid_symbols[]`.
* You can return *exactly one* token (`lexer->result_symbol = …; return true;`).
* If you can’t complete a token, you return `false`.
* There is no way to emit “token A, then immediately token B”.

---

## 2. What you’re asking for: **multi-token emission**

That would mean letting `scan()` emit **a sequence of tokens in one pass**, e.g.:

```
STRING_LENGTH_PREFIX
":"
STRING_VALUE
```

while still exposing them separately in the grammar.

---

### What would be necessary in Tree-sitter core?

1. **Extend `TSLexer` API**
   Currently you set:

   ```c
   lexer->result_symbol = SOME_TOKEN;
   return true;
   ```

   That’s a single output.
   To support multiple tokens, you’d need something like:

   ```c
   lexer->emit(SOME_TOKEN);
   lexer->emit(OTHER_TOKEN);
   return true;
   ```

   where `emit` pushes multiple results onto a queue.

2. **Change parser expectations**
   The parser is designed to receive one token per external scanner call.
   It would need to handle a *batch of tokens* emitted by the scanner and feed them back into the parse stack one by one.

3. **Serialization changes**
   If you stop mid-batch because of `eof()`, the parser would need to serialize:

   * Which token in the batch was last emitted, and
   * The scanner’s state.
     So on resume it knows “I already gave the prefix, colon is next.”

4. **Incremental reparsing**
   Multi-token batches complicate incremental reparse:

   * If a subtree is invalidated, Tree-sitter must decide whether to re-ask the scanner for the whole batch, or resume partway through.
   * That means the parser’s notion of “token boundaries” would no longer be strictly one-to-one with `scan()` calls.

---

### Why this *is* simpler than resumable tokens

* With resumable tokens, the parser has to treat “half-consumed tokens” as valid parse stack elements → very invasive.
* With multi-token emission, you still have *atomic tokens* (prefix, colon, value) in the syntax tree. The only change is: the scanner can provide more than one at a time.
* Parser logic changes are narrower: instead of “exactly one token from scanner,” allow “queue of tokens from scanner.”

---

### Minimal core changes (conceptual)

* Add a per-scanner **token queue** to `TSLexer`.
* Change `scan()` signature so it can fill that queue.
* Parser loop:

  * If queue non-empty, pop token from queue.
  * Else, call `scan()` to refill.
* Extend serialization so the “queue head index” is saved/restored.

---

### Trade-offs

* **Pros:**

  * You can keep prefix, colon, value as distinct syntax nodes.
  * You don’t need resumable mid-token state.
  * Cleaner mental model: scanner acts like a mini-lexer that can yield a run of tokens.

* **Cons:**

  * Still a non-trivial patch to Tree-sitter core.
  * Every consumer of `TSLexer` (bindings, parser runtime, language grammars) needs updating to understand multi-token output.
  * Incremental reparsing logic gets more complex around partial batches.

---

✅ So, to answer your question:
Yes, supporting **multi-token emission** would be *simpler* than resumable tokens, but it still requires **core modifications**:

* extend `TSLexer` to allow a queue of tokens,
* adjust the parser to consume from that queue,
* serialize the queue state between parse chunks.

> > Every consumer of TSLexer (bindings, parser runtime, language grammars) needs updating to understand multi-token output.
>
> it should be possible to create a backward-compatible API by supporting both single token emission via `lexer->result_symbol = SOME_TOKEN;` and multi token emission via `lexer->emit(SOME_TOKEN); lexer->emit(OTHER_TOKEN);`

Yes — exactly 👍

If you wanted to prototype this, the cleanest path is to **layer multi-token emission on top of the existing `result_symbol` API**, so that *old scanners keep working unchanged*.

Here’s what such a backward-compatible design could look like:

---

### 1. Extend `TSLexer` with a small queue

```c
typedef struct {
  uint16_t buffer[TS_MAX_MULTI_TOKENS];
  uint8_t count;
  uint8_t head;
} TokenQueue;

struct TSLexer {
  ...
  uint16_t result_symbol;  // legacy single-token mode
  TokenQueue token_queue;  // new multi-token mode
};
```

---

### 2. New API for scanners

```c
void ts_lexer_emit(TSLexer *lexer, uint16_t symbol) {
  if (lexer->token_queue.count < TS_MAX_MULTI_TOKENS) {
    uint8_t pos = (lexer->token_queue.head + lexer->token_queue.count) % TS_MAX_MULTI_TOKENS;
    lexer->token_queue.buffer[pos] = symbol;
    lexer->token_queue.count++;
  }
}
```

* Old code still does `lexer->result_symbol = TOKEN; return true;`.
* New code can do:

  ```c
  ts_lexer_emit(lexer, STRING_LENGTH_PREFIX);
  ts_lexer_emit(lexer, COLON);
  ts_lexer_emit(lexer, STRING_VALUE);
  return true;
  ```

---

### 3. Parser changes

* Before calling `scan()`, the parser first checks if `token_queue.count > 0`.
* If yes, pop one symbol and hand it to the parse table.
* If no, call `scan()` like today.
* If `scan()` fills the queue, tokens are fed out one by one in subsequent parser steps.

---

### 4. Serialization / deserialization

* In addition to serializing the scanner payload, you serialize:

  * `token_queue.head`
  * `token_queue.count`
  * The contents of the queue (typically a few `uint16_t`s).
* That way, if parsing stops mid-batch, Tree-sitter can resume with the remaining tokens.

---

### 5. Backward compatibility

* If a scanner only sets `result_symbol`, everything works as before.
* If it calls `ts_lexer_emit()`, the parser ignores `result_symbol` and uses the queue.
* That way no existing grammar/scanner breaks.

---

### 6. Incremental parsing considerations

* From the parser’s perspective, there’s still no such thing as a “half-token”: each symbol in the queue is atomic.
* If a parse is invalidated in the middle of a batch, the queue state gets reset when Tree-sitter re-invokes the scanner — just like today, but now with a queue to refill.

---

✅ With this design:

* You get to emit multiple logical tokens (prefix, colon, value).
* You don’t have to merge them into one external token.
* Old scanners keep working.
* Core changes are localized: add a token queue to `TSLexer`, teach the parser loop to consume it, add serialization for it.


</blockquote>

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

parse multiple long tokens in scanner.c #4857

Problem

minimal reprodocution

Expected behavior

What would it take to support resumable tokens?

Why Tree-sitter doesn’t do this

What you could do if you wanted to hack Tree-sitter

Realistic alternative

1. Current model (what exists now)

2. What you’re asking for: multi-token emission

What would be necessary in Tree-sitter core?

Why this is simpler than resumable tokens

Minimal core changes (conceptual)

Trade-offs

1. Extend `TSLexer` with a small queue

2. New API for scanners

3. Parser changes

4. Serialization / deserialization

5. Backward compatibility

6. Incremental parsing considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

parse multiple long tokens in scanner.c #4857

Description

Problem

minimal reprodocution

Expected behavior

What would it take to support resumable tokens?

Why Tree-sitter doesn’t do this

What you could do if you wanted to hack Tree-sitter

Realistic alternative

1. Current model (what exists now)

2. What you’re asking for: multi-token emission

What would be necessary in Tree-sitter core?

Why this is simpler than resumable tokens

Minimal core changes (conceptual)

Trade-offs

1. Extend TSLexer with a small queue

2. New API for scanners

3. Parser changes

4. Serialization / deserialization

5. Backward compatibility

6. Incremental parsing considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Extend `TSLexer` with a small queue