Skip to content

Conversation

pavel-esir
Copy link
Contributor

  • Currently for gguf model we cannot get string representation of special tokens.
  • In order to infer special tokens representation, we should run decode with skip_special_tokens=False.
  • But because GGUF models don't have 'openvino_tokenizers_version' rt_info genai::Tokenizers assumes it's and old model and doesn't set states which are required in order to run decode with skip_special_tokens=False.
  • We fixed getting model version, if it's a gguf then it's definitely newer than 2024.5.

Ticket: CVS-172426

@pavel-esir pavel-esir added the bug Something isn't working label Aug 25, 2025
@pavel-esir pavel-esir requested review from Wovchena and Copilot and removed request for Copilot August 25, 2025 11:09
@github-actions github-actions bot added category: tokenizers Tokenizer class or submodule update category: GGUF GGUF file reader labels Aug 25, 2025
@pavel-esir pavel-esir requested a review from mzegla August 25, 2025 11:09
@Copilot Copilot AI review requested due to automatic review settings August 27, 2025 09:53
Copilot

This comment was marked as outdated.

@pavel-esir pavel-esir added this to the 2025.4 milestone Aug 27, 2025
@pavel-esir pavel-esir requested a review from Copilot August 27, 2025 11:14
Copilot

This comment was marked as outdated.

@pavel-esir pavel-esir requested a review from as-suvorov August 27, 2025 12:01
@as-suvorov
Copy link
Collaborator

GGUF tests are failing: https://github.com/openvinotoolkit/openvino.genai/actions/runs/17265102491/job/48997729147?pr=2667#step:7:1243

@pavel-esir pavel-esir requested a review from rasapala September 5, 2025 10:08
@pavel-esir pavel-esir requested a review from apaniukov September 5, 2025 12:01
@as-suvorov as-suvorov enabled auto-merge September 5, 2025 12:15
@github-actions github-actions bot added the category: GHA CI based on Github actions label Sep 9, 2025
@Wovchena Wovchena requested a review from Copilot September 9, 2025 09:01
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the issue where GGUF models cannot get string representations of special tokens by properly setting version information for tokenizer models. The fix ensures that GGUF models are correctly identified as newer than version 24.5, enabling proper decode operations with special tokens.

Key changes:

  • Added version information to GGUF tokenizer models to indicate they are newer than 24.5
  • Fixed version checking logic to properly handle GGUF models
  • Enhanced test coverage for special token handling in GGUF models

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/cpp/src/tokenizer/tokenizer_impl.cpp Sets openvino_genai_version for GGUF models and fixes version checking logic
src/cpp/src/gguf_utils/gguf_tokenizer.cpp Adds quote_meta function to properly escape special tokens in regex patterns
tests/python_tests/test_gguf_reader.py Expands test coverage with special token prompts and token validation
.github/workflows/windows.yml Increases timeout for GGUF reader tests from 60 to 100 minutes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +125 to +137
std::string quote_meta(const std::string& str) {
std::string result = "(";

// todo: add also utf validate
for (char c : str) {
if (!std::isalnum(c) && c != '_') {
result += '\\';
}
result += c;
}
result += ")";
return result;
}
Copy link

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error in comment on line 128: 'utf validate' should be 'UTF validation' or 'UTF-8 validation'.

Copilot uses AI. Check for mistakes.

@as-suvorov as-suvorov added this pull request to the merge queue Sep 11, 2025
Merged via the queue into openvinotoolkit:master with commit 9d350f8 Sep 11, 2025
175 of 185 checks passed
@pavel-esir pavel-esir deleted the fix_getting_ver branch September 11, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working category: GGUF GGUF file reader category: GHA CI based on Github actions category: tokenizers Tokenizer class or submodule update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants