Skip to content

Conversation

@Vangmay
Copy link

@Vangmay Vangmay commented Nov 18, 2025

Fixes #14

Enables training directly on raw text files without requiring structured datasets. Adds RawTextDataLoader class with intelligent token-aware chunking, support for multiple formats (.txt, .md, .json, .jsonl, .csv), and CLI integration with --raw_text_file flag.

Usage: python unsloth-cli.py --raw_text_file book.txt --chunk_size 1024

Test: python tests/test_raw_text.py

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Vangmay, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data input capabilities of the training pipeline by enabling direct training on raw text files. It introduces a robust RawTextDataLoader that intelligently handles various file formats and performs token-aware chunking, along with a TextPreprocessor for data hygiene and validation. This change allows users to leverage unstructured text data more flexibly, streamlining the process of preparing diverse datasets for model training.

Highlights

  • Raw Text Training: Introduces the ability to train directly on raw text files, removing the dependency on structured datasets.
  • RawTextDataLoader: Adds a new class for loading and processing raw text, supporting intelligent token-aware chunking with configurable chunk_size and stride.
  • Multi-format Support: The loader can automatically detect and process .txt, .md, .json, .jsonl, and .csv file formats.
  • CLI Integration: New command-line arguments (--raw_text_file, --chunk_size, --stride, --training_mode) are added to unsloth-cli.py for easy access to this feature.
  • Text Preprocessing & Validation: Includes a TextPreprocessor class with utilities for cleaning text, extracting structured sections, adding special tokens, and validating dataset quality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for training on raw text files by adding a RawTextDataLoader. The implementation is well-structured, supporting various file formats and integrating with the CLI. However, I've identified a few critical issues that prevent the feature from working as intended. The most significant problem is that the new data loading logic in unsloth-cli.py is defined but never actually called, and the new RawTextDataLoader class is not properly exported, which will lead to an import error. Additionally, there's a new CLI argument that is unused and a notable performance inefficiency in the text chunking process. My review includes detailed feedback and suggestions to address these points.

Comment on lines 81 to 83
# First pass: tokenize the entire text to get accurate token counts
tokenized = self.tokenizer(text, return_tensors="pt", add_special_tokens=False)
tokens = tokenized["input_ids"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation reads and tokenizes the entire file at once. This approach can lead to very high memory consumption for large files (e.g., several gigabytes), potentially causing out-of-memory errors. For better scalability, consider implementing a streaming approach where the file is read and processed in smaller chunks instead of loading everything into memory.

@danielhanchen
Copy link
Contributor

@Vangmay Thanks for the PR and appreciate it! Would it be possible for you to address some of Gemini's comments - also @djsaunde could you see if this impacts ur CLI changes as well

@danielhanchen
Copy link
Contributor

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import logging
from unsloth import RawTextDataLoader

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Importing RawTextDataLoader from package fails

The CLI now imports RawTextDataLoader directly from unsloth, but unsloth/__init__.py never re-exports that class (it only exists in unsloth/dataprep/raw_text.py), so running the script will raise ImportError: cannot import name 'RawTextDataLoader' before any training begins and the new --raw_text_file flow is unusable until the package exposes the symbol or the import is made module-qualified.

Useful? React with 👍 / 👎.

# limitations under the License.

from .synthetic import *
from raw_text import *

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Incorrect non-relative import in dataprep package

In the new dataprep package init, from raw_text import * attempts to import a top-level module rather than the sibling unsloth.dataprep.raw_text; in a normal install this raises ModuleNotFoundError when importing unsloth.dataprep, so RawTextDataLoader/TextPreprocessor cannot be reached or re-exported.

Useful? React with 👍 / 👎.

Comment on lines +180 to +183
# Move to next chunk with stride overlap
if end_idx == len(tokens):
break
start_idx += chunk_size - stride

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Chunking loop can hang when stride ≥ chunk_size

The chunking loop advances start_idx by chunk_size - stride without guarding against a stride equal to or larger than the chunk size; if a caller passes such values (allowed by the CLI flags), start_idx never increases and while start_idx < len(tokens) will loop indefinitely on multi-chunk inputs, hanging tokenization.

Useful? React with 👍 / 👎.

@Vangmay
Copy link
Author

Vangmay commented Dec 10, 2025

Hello @danielhanchen, apologies for the delay. I have made the edits mentioned by codex!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Raw txt file training

2 participants