Feature/raw text dataprep #3612

Vangmay · 2025-11-18T14:38:08Z

Fixes #14

Enables training directly on raw text files without requiring structured datasets. Adds RawTextDataLoader class with intelligent token-aware chunking, support for multiple formats (.txt, .md, .json, .jsonl, .csv), and CLI integration with --raw_text_file flag.

Usage: python unsloth-cli.py --raw_text_file book.txt --chunk_size 1024

Test: python tests/test_raw_text.py

gemini-code-assist · 2025-11-18T14:38:28Z

Summary of Changes

Hello @Vangmay, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data input capabilities of the training pipeline by enabling direct training on raw text files. It introduces a robust RawTextDataLoader that intelligently handles various file formats and performs token-aware chunking, along with a TextPreprocessor for data hygiene and validation. This change allows users to leverage unstructured text data more flexibly, streamlining the process of preparing diverse datasets for model training.

Highlights

Raw Text Training: Introduces the ability to train directly on raw text files, removing the dependency on structured datasets.
RawTextDataLoader: Adds a new class for loading and processing raw text, supporting intelligent token-aware chunking with configurable chunk_size and stride.
Multi-format Support: The loader can automatically detect and process .txt, .md, .json, .jsonl, and .csv file formats.
CLI Integration: New command-line arguments (--raw_text_file, --chunk_size, --stride, --training_mode) are added to unsloth-cli.py for easy access to this feature.
Text Preprocessing & Validation: Includes a TextPreprocessor class with utilities for cleaning text, extracting structured sections, adding special tokens, and validating dataset quality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature for training on raw text files by adding a RawTextDataLoader. The implementation is well-structured, supporting various file formats and integrating with the CLI. However, I've identified a few critical issues that prevent the feature from working as intended. The most significant problem is that the new data loading logic in unsloth-cli.py is defined but never actually called, and the new RawTextDataLoader class is not properly exported, which will lead to an import error. Additionally, there's a new CLI argument that is unused and a notable performance inefficiency in the text chunking process. My review includes detailed feedback and suggestions to address these points.

unsloth-cli.py

unsloth/dataprep/raw_text.py

unsloth-cli.py

gemini-code-assist · 2025-11-18T14:40:34Z

unsloth/dataprep/raw_text.py

+        # First pass: tokenize the entire text to get accurate token counts
+        tokenized = self.tokenizer(text, return_tensors="pt", add_special_tokens=False)
+        tokens = tokenized["input_ids"]


The current implementation reads and tokenizes the entire file at once. This approach can lead to very high memory consumption for large files (e.g., several gigabytes), potentially causing out-of-memory errors. For better scalability, consider implementing a streaming approach where the file is read and processed in smaller chunks instead of loading everything into memory.

unsloth/dataprep/raw_text.py

danielhanchen · 2025-11-20T04:21:39Z

@Vangmay Thanks for the PR and appreciate it! Would it be possible for you to address some of Gemini's comments - also @djsaunde could you see if this impacts ur CLI changes as well

for more information, see https://pre-commit.ci

…y/unsloth into feature/raw-text-dataprep

for more information, see https://pre-commit.ci

danielhanchen · 2025-11-30T04:37:18Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-30T04:40:29Z

unsloth-cli.py

    from transformers import TrainingArguments
    from unsloth import is_bfloat16_supported
    import logging
+    from unsloth import RawTextDataLoader


Importing RawTextDataLoader from package fails

The CLI now imports RawTextDataLoader directly from unsloth, but unsloth/__init__.py never re-exports that class (it only exists in unsloth/dataprep/raw_text.py), so running the script will raise ImportError: cannot import name 'RawTextDataLoader' before any training begins and the new --raw_text_file flow is unusable until the package exposes the symbol or the import is made module-qualified.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-11-30T04:40:29Z

unsloth/dataprep/__init__.py

 # limitations under the License.

 from .synthetic import *
+from raw_text import *


Incorrect non-relative import in dataprep package

In the new dataprep package init, from raw_text import * attempts to import a top-level module rather than the sibling unsloth.dataprep.raw_text; in a normal install this raises ModuleNotFoundError when importing unsloth.dataprep, so RawTextDataLoader/TextPreprocessor cannot be reached or re-exported.

Useful? React with 👍 / 👎.

tests/test_raw_text.py

chatgpt-codex-connector · 2025-11-30T04:40:29Z

unsloth/dataprep/raw_text.py

+            # Move to next chunk with stride overlap
+            if end_idx == len(tokens):
+                break
+            start_idx += chunk_size - stride


Chunking loop can hang when stride ≥ chunk_size

The chunking loop advances start_idx by chunk_size - stride without guarding against a stride equal to or larger than the chunk size; if a caller passes such values (allowed by the CLI flags), start_idx never increases and while start_idx < len(tokens) will loop indefinitely on multi-chunk inputs, hanging tokenization.

Useful? React with 👍 / 👎.

for more information, see https://pre-commit.ci

Vangmay · 2025-12-10T05:17:21Z

Hello @danielhanchen, apologies for the delay. I have made the edits mentioned by codex!

Vangmay added 7 commits November 18, 2025 21:46

Write file and template for raw_text dataprep

6647193

Add implementation to cli

1d07dd9

Add support for multiple files

0a9e219

Write chunking logic

ea10d8c

Add logic to clean and extract text sections

2ff2942

Add validation code

e528a5b

Write simple test

5017d97

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

Add module to init

e3f3312

pre-commit-ci bot and others added 8 commits November 20, 2025 12:51

[pre-commit.ci] auto fixes from pre-commit.com hooks

ffc68be

for more information, see https://pre-commit.ci

Integrate smart dataset loader

4362d9e

Merge branch 'feature/raw-text-dataprep' of https://github.com/Vangma…

59da823

…y/unsloth into feature/raw-text-dataprep

[pre-commit.ci] auto fixes from pre-commit.com hooks

0aa8f3f

for more information, see https://pre-commit.ci

Make the chunk function efficient

a12eefa

[pre-commit.ci] auto fixes from pre-commit.com hooks

73ed28c

for more information, see https://pre-commit.ci

remove old function

76b25c7

Remove training mode arg

082d9c1

chatgpt-codex-connector bot reviewed Nov 30, 2025

View reviewed changes

Vangmay and others added 4 commits December 10, 2025 10:15

Fix RawTextDataLoader import issue

112a893

Fix Incorrect non-relative import in dataprep package

505f97f

Fix Chunking loop can hang when stride ≥ chunk_size

b76cfb8

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b01fc8

for more information, see https://pre-commit.ci

Uh oh!

Feature/raw text dataprep #3612

Are you sure you want to change the base?

Feature/raw text dataprep #3612

Conversation

Vangmay commented Nov 18, 2025

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielhanchen commented Nov 20, 2025

Uh oh!

danielhanchen commented Nov 30, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Vangmay commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants