-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Deprecate unused dataset_formatting module #4242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Per reviewer feedback, adding FutureWarning to all functions in dataset_formatting module instead of immediate removal. Functions will be removed in TRL 0.27. - Added warnings.warn() to get_formatting_func_from_dataset() - Added warnings.warn() to conversations_formatting_function() - Added warnings.warn() to instructions_formatting_function() - Updated docstrings with deprecation notices - Added pytest.mark.filterwarnings to test class to suppress expected warnings
1baa79b
to
eda2f04
Compare
✅ Added deprecation warnings instead of immediate removal. Changes made:
All functions now warn users to use |
Resolve qgallouedec's review comment by: - Deleting docs/source/detoxifying_a_lm.md (obsolete toxicity documentation) - Removing reference from _toctree.yml - Removing research_projects references from example_overview.md - Removing stack_llama script references from peft_integration.md All removed documentation referenced deleted research_projects scripts that no longer exist in the repository.
This reverts commit 24cb61a.
Can you please rename the PR "deprecate" |
Done 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Analysis:
trl.extras.dataset_formatting
📁 Current State
File:
trl/extras/dataset_formatting.py
Functions:
get_formatting_func_from_dataset()
- Auto-detects dataset formatconversations_formatting_function()
- Formats ChatML-style conversationsinstructions_formatting_function()
- Formats instruction-completion pairs🔍 Usage Analysis
Internal Usage:
trl/extras/__init__.py
trl/__init__.py
(public API)SFTTrainer
,GKDTrainer
, etc.)tests/test_dataset_formatting.py
External Usage Risk:
from trl.extras.dataset_formatting import get_formatting_func_from_dataset
📜 History
Added: PR #1208 (ChatML support)
Last Modified: July 2025, PR #3704 - removed
ConstantLengthDataset
dependencyDeprecation Status: ❌ No warnings currently in place
🤔 Why It's Obsolete
tokenizer.apply_chat_template()
SFTTrainer
accepts user-providedformatting_func
, not auto-detected onesChanges
trl/extras/dataset_formatting.py
tests/test_dataset_formatting.py