Skip to content

Conversation

behroozazarkhalili
Copy link
Collaborator

Analysis: trl.extras.dataset_formatting

📁 Current State

File: trl/extras/dataset_formatting.py

Functions:

  • get_formatting_func_from_dataset() - Auto-detects dataset format
  • conversations_formatting_function() - Formats ChatML-style conversations
  • instructions_formatting_function() - Formats instruction-completion pairs

🔍 Usage Analysis

Internal Usage:

  • ❌ NOT exported in trl/extras/__init__.py
  • ❌ NOT exported in trl/__init__.py (public API)
  • ❌ NOT used by any trainer (SFTTrainer, GKDTrainer, etc.)
  • ❌ NOT used in any examples directory
  • ❌ NOT mentioned in documentation
  • ONLY used in tests/test_dataset_formatting.py

External Usage Risk:

  • ⚠️ Users could directly import: from trl.extras.dataset_formatting import get_formatting_func_from_dataset
  • ⚠️ Possible usage in external blogs/tutorials (not in this repo)

📜 History

Added: PR #1208 (ChatML support)
Last Modified: July 2025, PR #3704 - removed ConstantLengthDataset dependency
Deprecation Status: ❌ No warnings currently in place

🤔 Why It's Obsolete

  1. Modern Approach: Users now directly use tokenizer.apply_chat_template()
  2. Limited Functionality: Only supports 2 formats (ChatML, instruction)
  3. Brittle Detection: Schema-based detection is fragile
  4. Superseded: SFTTrainer accepts user-provided formatting_func, not auto-detected ones

Changes

  • Removed trl/extras/dataset_formatting.py
  • Removed tests/test_dataset_formatting.py

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Per reviewer feedback, adding FutureWarning to all functions in dataset_formatting module instead of immediate removal. Functions will be removed in TRL 0.27.

- Added warnings.warn() to get_formatting_func_from_dataset()
- Added warnings.warn() to conversations_formatting_function()
- Added warnings.warn() to instructions_formatting_function()
- Updated docstrings with deprecation notices
- Added pytest.mark.filterwarnings to test class to suppress expected warnings
@behroozazarkhalili behroozazarkhalili force-pushed the remove-dataset-formatting-module branch from 1baa79b to eda2f04 Compare October 9, 2025 17:37
@behroozazarkhalili
Copy link
Collaborator Author

✅ Added deprecation warnings instead of immediate removal.

Changes made:

  • Added FutureWarning to all three functions (get_formatting_func_from_dataset, conversations_formatting_function, instructions_formatting_function)
  • Functions will be removed in TRL 0.27 (3 versions from now)
  • Updated docstrings with deprecation notices
  • Added pytest filter to suppress expected warnings in tests

All functions now warn users to use tokenizer.apply_chat_template() directly instead.

behroozazarkhalili and others added 4 commits October 9, 2025 10:39
Resolve qgallouedec's review comment by:
- Deleting docs/source/detoxifying_a_lm.md (obsolete toxicity documentation)
- Removing reference from _toctree.yml
- Removing research_projects references from example_overview.md
- Removing stack_llama script references from peft_integration.md

All removed documentation referenced deleted research_projects scripts
that no longer exist in the repository.
@qgallouedec
Copy link
Member

Can you please rename the PR "deprecate"

@behroozazarkhalili behroozazarkhalili changed the title Remove unused dataset_formatting module Deprecate unused dataset_formatting module Oct 10, 2025
@behroozazarkhalili
Copy link
Collaborator Author

Can you please rename the PR "deprecate"

Done 👍

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@qgallouedec qgallouedec merged commit 039d526 into main Oct 10, 2025
10 of 12 checks passed
@qgallouedec qgallouedec deleted the remove-dataset-formatting-module branch October 10, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants