Add cross-tokenizer distillation support for GKD and MiniLLM trainers #4561
+424
−22
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem Statement
The Issue
In on-policy distillation (GKD and MiniLLM trainers), when student and teacher models use different tokenizers, the training process produces incorrect results. Specifically:
<|start_header_id|>vs Qwen's<|im_start|>) further exacerbate the issueRoot Cause Analysis
What We Found
Through careful code inspection and testing, we identified that both GKDTrainer and MiniLLMTrainer had the same fundamental issue:
Example Scenario
Proposed Solution
High-Level Approach
We implemented a text-based re-tokenization approach:
Key Design Decisions
teacher_tokenizer_name_or_pathis specified in configImplementation Details
Components Implemented
1. Configuration Updates
teacher_tokenizer_name_or_pathparameter to bothGKDConfigandMiniLLMConfig2. Teacher Tokenizer Loading
3. Text Preservation Pipeline
4. Re-tokenization Utility
build_teacher_inputs_from_texts()for both trainers5. Cross-Tokenizer Loss Computation
6. Safety and Validation
Testing Strategy
Unit Tests
We verified each component independently:
teacher_tokenizer_name_or_pathloads correctlyEnd-to-End Tests
We validated the complete pipeline with real scenarios:
Different Tokenizers Test (Qwen ↔ Llama):
Same Tokenizer Test (Qwen ↔ Qwen):
Edge Cases Verified
We systematically tested all critical edge cases:
Verification Results
What We Validated
✅ Original Bug Fixed: Teacher now receives correctly tokenized inputs
✅ GKD Implementation: Full cross-tokenizer support with text-aligned loss
✅ MiniLLM Implementation: Full cross-tokenizer support reusing GKD utilities
✅ Backward Compatible: Same-tokenizer scenarios unchanged
✅ Different Vocab Sizes: Successfully handles 151K ↔ 128K token vocabularies
✅ Warning System: Alerts users when configuration may be incorrect
✅ Error Handling: Clear errors for incompatible configurations
✅ All Edge Cases: 12 edge cases systematically verified
Test Coverage Summary
Usage Example
Before (Would Fail or Give Wrong Results)
After (Works Correctly)
Screenshots
Test Results
1 . Existing tests
2. GKD with Different Tokenizers (Qwen ↔ Llama)
3. GKD Same Tokenizer (Backward Compatibility)
4. MiniLLM Sanity Check
Related Issues