Add cross-tokenizer distillation support for GKD and MiniLLM trainers #4561

sambhavnoobcoder · 2025-11-22T16:53:00Z

Problem Statement

The Issue

In on-policy distillation (GKD and MiniLLM trainers), when student and teacher models use different tokenizers, the training process produces incorrect results. Specifically:

The student model generates rollouts using its own tokenizer
The teacher model receives the raw student-tokenized input without re-tokenization
This causes teacher logprobs to be computed in low-probability regions
Different chat templates (e.g., Llama's <|start_header_id|> vs Qwen's <|im_start|>) further exacerbate the issue

Root Cause Analysis

What We Found

Through careful code inspection and testing, we identified that both GKDTrainer and MiniLLMTrainer had the same fundamental issue:

Student Generation Phase: The student model generates completions using its own tokenizer, producing token IDs specific to its vocabulary
Teacher Evaluation Phase (BUGGY): The teacher model receives these student-tokenized IDs directly
The Problem: The teacher's tokenizer has a different vocabulary mapping, so these token IDs represent completely different tokens in the teacher's vocabulary
Result: Teacher logprobs are computed on nonsensical token sequences, leading to incorrect probability distributions

Example Scenario

Student (Qwen) tokenizes "Hello" → token ID 123
In Qwen's vocabulary: 123 = "Hello"
In teacher's vocabulary (Llama): 123 = "World" (different token!)
Teacher computes logprobs for "World" instead of "Hello" → wrong probability distribution

Proposed Solution

High-Level Approach

We implemented a text-based re-tokenization approach:

Text Preservation: Preserve the text content from student-generated rollouts
Re-tokenization: Convert the text back to tokens using the teacher's tokenizer
Correct Evaluation: Teacher processes tokens from its own vocabulary
Text-Aligned Loss: For different vocabulary sizes, align predictions via text decoding/encoding

Key Design Decisions

Opt-in Feature: Only activates when teacher_tokenizer_name_or_path is specified in config
Backward Compatible: Same-tokenizer scenarios use the original code path (no performance overhead)
Handles Vocab Mismatches: Text-aligned loss works with any vocabulary sizes
Both Trainers Fixed: Consistent implementation across GKD and MiniLLM

Implementation Details

Components Implemented

1. Configuration Updates

Added teacher_tokenizer_name_or_path parameter to both GKDConfig and MiniLLMConfig
Comprehensive docstrings explaining when and how to use the parameter

2. Teacher Tokenizer Loading

Automatic loading of teacher's tokenizer when config parameter is set
Warning system when models appear different but tokenizer not specified
Liger kernel incompatibility check (raises clear error if both are enabled)

3. Text Preservation Pipeline

Modified generation pipeline to return text alongside tokens
Fallback decode path if text not preserved in inputs dictionary
Special token preservation to maintain structure

4. Re-tokenization Utility

Shared utility function build_teacher_inputs_from_texts() for both trainers
Handles prompt and completion concatenation
Proper label masking (padding tokens → -100, prompt tokens → -100)
Device placement handling

5. Cross-Tokenizer Loss Computation

Conditional branching: cross-tokenizer path vs same-tokenizer path
Text-aligned loss for vocabulary size mismatches
Teacher predictions decoded and re-encoded to student vocab space
Sequence length alignment (handles different tokenizer outputs)

6. Safety and Validation

Model mismatch detection with user-friendly warnings
Assertion checks for teacher tokenizer loading
Proper error messages for incompatible configurations

Testing Strategy

Unit Tests

We verified each component independently:

Config Parameter Test: Verified teacher_tokenizer_name_or_path loads correctly
Tokenizer Loading Test: Confirmed teacher tokenizer loads with correct vocabulary size
Backward Compatibility Test: Verified same-tokenizer scenarios use original path
Warning System Test: Confirmed warnings trigger when models differ without tokenizer specified

End-to-End Tests

We validated the complete pipeline with real scenarios:

Different Tokenizers Test (Qwen ↔ Llama):
- Student: tiny-Qwen2ForCausalLM-2.5 (151,665 tokens)
- Teacher: tiny-LlamaForCausalLM-3.2 (128,256 tokens)
- Result: Training completed successfully, loss stable (11.9269)
Same Tokenizer Test (Qwen ↔ Qwen):
- Both models using same tokenizer
- Result: Training completed successfully, loss stable (11.8981)
- Confirms backward compatibility

Edge Cases Verified

We systematically tested all critical edge cases:

No teacher tokenizer specified (backward compatibility)
Models differ but tokenizer not specified (warning issued)
Liger kernel + cross-tokenizer (error raised)
Different sequence lengths (min_length alignment)
Padding token handling (masked with -100)
Text not preserved (fallback decode path)
Different vocabulary sizes (real test: 151K vs 128K)
Empty completions (special tokens preserved)
Batch size = 1
Teacher in eval mode
Device mismatches
Chat template differences

Verification Results

What We Validated

✅ Original Bug Fixed: Teacher now receives correctly tokenized inputs
✅ GKD Implementation: Full cross-tokenizer support with text-aligned loss
✅ MiniLLM Implementation: Full cross-tokenizer support reusing GKD utilities
✅ Backward Compatible: Same-tokenizer scenarios unchanged
✅ Different Vocab Sizes: Successfully handles 151K ↔ 128K token vocabularies
✅ Warning System: Alerts users when configuration may be incorrect
✅ Error Handling: Clear errors for incompatible configurations
✅ All Edge Cases: 12 edge cases systematically verified

Test Coverage Summary

Test Category	Status	Details
Config Loading	✅ PASSED	Parameter exists and works correctly
Tokenizer Loading	✅ PASSED	Teacher tokenizer loads with correct vocab
Different Tokenizers	✅ PASSED	Qwen (151K) ↔ Llama (128K) training succeeds
Same Tokenizer	✅ PASSED	Backward compatibility maintained
Warning System	✅ PASSED	Alerts for potential misconfigurations
Edge Cases	✅ PASSED	All 12 edge cases handled correctly

Usage Example

Before (Would Fail or Give Wrong Results)

# This would fail or produce incorrect results
config = GKDConfig(output_dir="./output")
trainer = GKDTrainer(
    model="Qwen/Qwen2.5-0.5B",           # Different tokenizer
    teacher_model="meta-llama/Llama-3.2-1B",  # Different tokenizer
    args=config,
    # ... other args
)
trainer.train()  # ❌ Wrong teacher logprobs!

After (Works Correctly)

# Now works correctly with cross-tokenizer support
config = GKDConfig(
    output_dir="./output",
    teacher_tokenizer_name_or_path="meta-llama/Llama-3.2-1B",  # Key parameter!
)
trainer = GKDTrainer(
    model="Qwen/Qwen2.5-0.5B",           # Different tokenizer
    teacher_model="meta-llama/Llama-3.2-1B",  # Different tokenizer
    args=config,
    # ... other args
)
trainer.train()  # ✅ Correct teacher logprobs!

Screenshots

Test Results

1 . Existing tests

2. GKD with Different Tokenizers (Qwen ↔ Llama)

3. GKD Same Tokenizer (Backward Compatibility)

4. MiniLLM Sanity Check

Related Issues

Fixes: Cross-tokenizer distillation fails in GKD and MiniLLM trainers #4562

…rent vocab sizes

…ainer

…nce instead of CE loss with pseudo-labels, preserving on-policy objective

sambhavnoobcoder added 3 commits November 22, 2025 20:28

feat(gkd): Add cross-tokenizer distillation support for GKDTrainer

e2d5250

feat(gkd): Implement text-aligned loss for cross-tokenizer with diffe…

e816c21

…rent vocab sizes

feat(minillm): Add cross-tokenizer distillation support for MiniLLMTr…

899492c

…ainer

sambhavnoobcoder mentioned this pull request Nov 22, 2025

Cross-tokenizer distillation fails in GKD and MiniLLM trainers #4562

Open

sambhavnoobcoder changed the title ~~Fix: Add cross-tokenizer distillation support for GKD and MiniLLM trainers~~ Add cross-tokenizer distillation support for GKD and MiniLLM trainers Nov 22, 2025

Fix GKD cross-tokenizer distillation to use sequence-level KL diverge…

e92a7c0

…nce instead of CE loss with pseudo-labels, preserving on-policy objective

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cross-tokenizer distillation support for GKD and MiniLLM trainers #4561

Add cross-tokenizer distillation support for GKD and MiniLLM trainers #4561

sambhavnoobcoder commented Nov 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add cross-tokenizer distillation support for GKD and MiniLLM trainers #4561

Are you sure you want to change the base?

Add cross-tokenizer distillation support for GKD and MiniLLM trainers #4561

Conversation

sambhavnoobcoder commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

The Issue

Root Cause Analysis

What We Found

Example Scenario

Proposed Solution

High-Level Approach

Key Design Decisions

Implementation Details

Components Implemented

1. Configuration Updates

2. Teacher Tokenizer Loading

3. Text Preservation Pipeline

4. Re-tokenization Utility

5. Cross-Tokenizer Loss Computation

6. Safety and Validation

Testing Strategy

Unit Tests

End-to-End Tests

Edge Cases Verified

Verification Results

What We Validated

Test Coverage Summary

Usage Example

Before (Would Fail or Give Wrong Results)

After (Works Correctly)

Screenshots

Test Results

1 . Existing tests

2. GKD with Different Tokenizers (Qwen ↔ Llama)

3. GKD Same Tokenizer (Backward Compatibility)

4. MiniLLM Sanity Check

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sambhavnoobcoder commented Nov 22, 2025 •

edited

Loading