Skip to content

Bug: Broken error messages in token normalization—{token!r} shown literally instead of actual token value #579

@markknoffler

Description

@markknoffler

Description

In _normalize_token, when a stop_token or forbidden_token string maps to multiple token IDs, the code raises a ValueError. The error message uses {token!r} in a regular string instead of an f-string, so users see the literal {token!r} instead of the actual invalid token value. This makes it harder to debug misconfigured stop/forbidden tokens.

Example error shown:

ValueError: Invalid token: {token!r}. `stop_token`s and `forbidden_token`s must map to single token ids in the vocab.

Instead of:

ValueError: Invalid token: 'hello world'. `stop_token`s and `forbidden_token`s must map to single token ids in the vocab.

Locations

  1. gemma/gm/text/_sampler.py

  2. gemma/research/t5gemma/sampling.py

Steps to reproduce

  1. Create a tokenizer (e.g. Tokenizer.from_version(3)).
  2. Call _normalize_token(tokenizer, "hello world") (or any string that tokenizes to more than one token).
  3. The ValueError message will contain the literal {token!r} instead of the actual token string.
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions