Exposes API for processing pretraining data #672

RobotSail · 2025-11-25T02:31:34Z

This PR exposes a new API which allows the ingestion of datasets containing a document field, which then gets tokenized as input_ids.

Summary by CodeRabbit

New Features
- Continual pretraining mode with configurable block size and document column (CLI flags --block-size, --document-column-name and programmatic option).
- Document preprocessing tool that tokenizes documents, ensures EOS tokens, logs stats, and outputs JSONL for training.
- Block-based dataset/loader path that concatenates tokens into fixed-size padded blocks for pretraining and exposes a pretraining configuration option.
Tests
- Unit tests for preprocessing, block dataset behavior, and loader batching.
Documentation
- Added "Continual pretraining mode" section with CLI and programmatic examples.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-25T02:31:47Z

Walkthrough

Added continual pretraining support: new PretrainingConfig and related TrainingArgs field, document-to-JSONL pretraining processor, PretrainingBlockDataset and loader branch for fixed-size block sampling, CLI/training plumbing for block-size and document-column-name, public export of PretrainingConfig, and unit tests for the pretraining flow.

Changes

Cohort / File(s)	Summary
Configuration `src/instructlab/training/config.py`	Added `PretrainingConfig(BaseModel)` with `block_size: int` and `document_column_name: str = "document"`. Extended `DataProcessArgs` with `is_pretraining: bool = False` and `pretraining_column_name: str = "document"`. Added `pretraining_config: Optional[PretrainingConfig] = None` to `TrainingArgs` (public API).
Pretraining data processing `src/instructlab/training/data_process.py`	Added `process_documents_for_pretraining(...)` which validates outputs, loads a JSON dataset, ensures the specified document column exists, loads tokenizer (requires `eos_token_id`), tokenizes documents (appends EOS if missing), writes per-document records `{ "input_ids": [...], "len": N }` to `data.jsonl`, and logs statistics and progress. Updated relevant tokenizer calls to use `return_dict=False`.
Data loading & sampling `src/instructlab/training/sampler.py`	Added `PretrainingBlockDataset` that concatenates `input_ids`, splits into fixed-size blocks (pads final block), provides `from_jsonl_file`, `__len__`, `__getitem__`, and `get_lengths`. `get_data_loader` now accepts `pretraining_config: Optional[PretrainingConfig]` and selects the pretraining block dataset when provided. Added logging and necessary imports.
Training entry / CLI / orchestration `src/instructlab/training/main_ds.py`	When `pretraining_config` is present, call `process_documents_for_pretraining` and pass `pretraining_config` into `get_data_loader`. CLI extended with `--block-size` and `--document-column-name` flags; `--block-size` (and doc column) appended to distributed training command when configured. `PretrainingConfig` is imported/exported for public usage.
Public exports `src/instructlab/training/__init__.py`	Exposed `PretrainingConfig` by importing it from `.config` and adding it to `__all__`.
Tests `tests/unit/test_pretraining_mode.py`, `tests/unit/test_data_process.py`	Added tests covering `PretrainingBlockDataset` behavior (padding, lengths, labels), data loader selection and batch structure for pretraining, and `process_documents_for_pretraining` end-to-end with a mocked tokenizer and validation of `data.jsonl` contents. Updated mocks in `test_data_process.py` to support `return_dict=True` behavior and kwargs passthrough.
Docs `README.md`	Added "Continual pretraining mode" section and CLI/programmatic examples for `--block-size` and `--document-column-name`, plus navigation entry and minor heading adjustment.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI / Orchestrator
    participant Main as main_ds.train
    participant Proc as data_process.process_documents_for_pretraining
    participant Loader as sampler.get_data_loader
    participant Trainer as DistributedTrainer
    Note over Main: start training orchestration

    CLI->>Main: parse args (may include --block-size, --document-column-name)
    Main->>Main: set train_args.pretraining_config (optional)
    alt pretraining_config provided
        Main->>Proc: process_documents_for_pretraining(data_path, out_path, model_path, ...)
        Proc-->>Main: writes data.jsonl (tokenized documents)
        Main->>Loader: get_data_loader(..., pretraining_config)
        Loader-->>Main: returns DataLoader(PretrainingBlockDataset)
        Main->>Trainer: assemble training command with --block-size & --document-column-name
    else standard training
        Main->>Proc: process_data(...) (existing pipeline)
        Main->>Loader: get_data_loader(..., pretraining_config=None)
        Loader-->>Main: returns DataLoader(TokenDataset)
        Main->>Trainer: assemble standard training command
    end
    Main->>Trainer: launch distributed training
    Trainer-->>Main: monitor/exit

sequenceDiagram
    autonumber
    participant Caller as process_documents_for_pretraining()
    participant FS as FileSystem
    participant Tok as AutoTokenizer
    Note over Caller: tokenization -> jsonl output per document

    Caller->>FS: open/read JSON dataset
    FS-->>Caller: dataset
    Caller->>Tok: AutoTokenizer.from_pretrained(model_path)
    Tok-->>Caller: tokenizer (must provide eos_token_id)

    loop per document
      Caller->>Tok: tokenizer(doc, add_special_tokens=True, return_dict=False)
      Tok-->>Caller: input_ids
      Caller->>Caller: append EOS if missing
      Caller->>FS: append `{ "input_ids": [...], "len": N }` to data.jsonl
    end

    Caller->>FS: close/write complete
    Caller-->>Caller: log totals and average token length

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to:
- PretrainingBlockDataset correctness (concatenation, block splitting, padding, label alignment, last-block edge cases).
- Tokenizer usage: return_dict=False calls, EOS detection/append logic, and compatibility across tokenizer implementations.
- CLI and distributed training command assembly for proper flag propagation and escaping.
- Tests: mocks accurately reflect tokenizer behavior and file I/O assertions are robust.

Poem

🐇 I nibble docs and count each token bright,

EOS tucked in, blocks snug through the night.
JSONL footprints line the training trail,
Padding whispers, labels never fail,
Hop—models wake from a rabbit's tale.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title directly and clearly describes the main change: exposing an API for processing pretraining data, which aligns with the core functionality additions across config, data processing, and public module exports.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fede3e7 and 27c9f55.

📒 Files selected for processing (1)

tests/unit/test_data_process.py (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (2)

tests/unit/test_data_process.py (2)

65-103: LGTM! Mock correctly implements return_dict parameter.

The mock implementation properly supports the return_dict parameter, returning a dictionary with an "input_ids" key when True, or the raw result when False. The default value maintains backward compatibility with existing test cases.

559-584: LGTM! Consistent implementation across both mocks.

This second mock implementation follows the same pattern as the first, correctly supporting the return_dict parameter. The consistency across both test classes ensures uniform behavior in the test suite.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/instructlab/training/data_process.py (2)
1202-1204: Prefer lazy formatting in logger calls.

Using f-strings in logger.info() evaluates the string even when the log level is disabled. Use lazy formatting for better performance.
-    logger.info(f"Processed {len(tokenized_data):,} documents")
-    logger.info(f"Total tokens: {total_tokens:,}")
-    logger.info(f"Average tokens per document: {avg_tokens:.1f}")
+    logger.info("Processed %s documents", f"{len(tokenized_data):,}")
+    logger.info("Total tokens: %s", f"{total_tokens:,}")
+    logger.info("Average tokens per document: %.1f", avg_tokens)
1217-1218: Same lazy formatting suggestion for these logger calls.
-    logger.info(f"Saved tokenized documents to {output_file}")
+    logger.info("Saved tokenized documents to %s", output_file)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 781c36f and 2b3c3ef.

📒 Files selected for processing (2)

src/instructlab/training/config.py (1 hunks)
src/instructlab/training/data_process.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (4)

src/instructlab/training/config.py (2)

66-70: LGTM!

The is_pretraining field is well-defined with a clear description that explains its purpose.

75-80: LGTM!

The validator correctly enforces mutual exclusivity between is_pretraining mode and chat_tmpl_path, which aligns with the pretraining workflow that tokenizes raw documents without chat templates.
src/instructlab/training/data_process.py (2)
1156-1179: LGTM!

Good defensive programming with proper validation for write permissions, dataset format, field requirements, and tokenizer EOS token.

1136-1141: Verify integration: is_pretraining flag not wired to this function.

The DataProcessArgs.is_pretraining flag was added in config.py, but neither main() nor process_data() in this file dispatches to process_documents_for_pretraining based on that flag. Users would need to call this function directly.

If this is intentional (exposing API only), the approach is fine. Otherwise, consider adding integration in process_data():
def process_data(..., is_pretraining: bool = False):
    if is_pretraining:
        process_documents_for_pretraining(
            data_path=data_path,
            data_output_path=data_output_path,
            model_path=model_path,
            num_cpu_procs=num_cpu_procs,
        )
    elif chat_tmpl_path:
        # existing legacy path...

coderabbitai · 2025-11-25T02:36:02Z

src/instructlab/training/data_process.py

+    Pattern: Each document → [BOS][tokens][EOS]
+
+    Args:
+        data_path: Path to input JSONL with {"documents": "text"} format


⚠️ Potential issue | 🟡 Minor

Fix documentation: field name should be document (singular).

The docstring says {"documents": "text"} but Line 1170 validates for 'document' (singular). This inconsistency will confuse API users.

- data_path: Path to input JSONL with {"documents": "text"} format + data_path: Path to input JSONL with {"document": "text"} format

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

data_path: Path to input JSONL with {"documents": "text"} format

data_path: Path to input JSONL with {"document": "text"} format

🤖 Prompt for AI Agents

In src/instructlab/training/data_process.py around line 1151 (and nearby docstring area), the docstring incorrectly documents the input JSONL as {"documents": "text"} while the code validates for the key 'document' (singular); update the docstring to {"document": "text"} to match the validation (or alternatively change the validation to accept 'documents' if you intend plural), and ensure both the docstring and validation use the same field name so API users are not confused.

src/instructlab/training/data_process.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/instructlab/training/data_process.py (1)
1152-1152: Fix documentation: field name should be document (singular).

The docstring says {"documents": "text"} but the code validates for 'document' (singular) at Line 1171 and the parameter defaults to "document". This inconsistency will confuse API users.
-        data_path: Path to input JSONL with {"documents": "text"} format
+        data_path: Path to input JSONL with {"document": "text"} format

🧹 Nitpick comments (1)

src/instructlab/training/data_process.py (1)
1176-1180: Consider adding error handling for tokenizer loading and validating BOS token.

For consistency with the dataset loading pattern (lines 1161-1166), consider wrapping the tokenizer loading in a try/except. Additionally, since the docstring at Line 1149 documents that documents follow the [BOS][tokens][EOS] pattern, consider validating that tokenizer.bos_token_id exists, similar to the EOS validation.
 logger.info("Loading tokenizer from %s", model_path)
-tokenizer = AutoTokenizer.from_pretrained(model_path)
+try:
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+except Exception as e:
+    raise ValueError(
+        f"Failed to load tokenizer from {model_path}"
+    ) from e

+if tokenizer.bos_token_id is None:
+    logger.warning("Tokenizer does not have a BOS token defined")
+
 if tokenizer.eos_token_id is None:
     raise ValueError("Tokenizer must have an EOS token defined for pretraining")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b3c3ef and 6b6ed96.

📒 Files selected for processing (2)

src/instructlab/training/config.py (1 hunks)
src/instructlab/training/data_process.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/instructlab/training/config.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (1)

src/instructlab/training/data_process.py (1)

1190-1192: Good defensive guard against duplicate EOS tokens.

The guard correctly prevents duplicate EOS tokens, addressing the concern from the previous review. This handles tokenizers that may or may not automatically append EOS with add_special_tokens=True.

coderabbitai · 2025-11-26T04:02:16Z

src/instructlab/training/data_process.py

+    def tokenize_document(sample):
+        input_ids = tokenizer.encode(
+            sample[document_column_name], add_special_tokens=True
+        )
+
+        # ensures eos token is present without double-adding it.
+        if input_ids[-1] != tokenizer.eos_token_id:
+            input_ids.append(tokenizer.eos_token_id)
+
+        return {
+            "input_ids": input_ids,
+            "len": len(input_ids),
+        }


🛠️ Refactor suggestion | 🟠 Major

Add validation for empty or None documents.

The code doesn't validate that sample[document_column_name] is non-empty before tokenization. Empty strings would produce only [BOS][EOS] tokens, and None values would cause tokenizer.encode() to fail.

Apply this diff to add validation:

def tokenize_document(sample): + doc_text = sample[document_column_name] + if not doc_text or not doc_text.strip(): + raise ValueError(f"Document at column '{document_column_name}' is empty or None") + input_ids = tokenizer.encode( - sample[document_column_name], add_special_tokens=True + doc_text, add_special_tokens=True )

🤖 Prompt for AI Agents

In src/instructlab/training/data_process.py around lines 1185 to 1197, the tokenize_document function does not validate sample[document_column_name] before calling tokenizer.encode which will fail on None and produce invalid output for empty strings; update the function to first retrieve the field into a local variable, check that it is not None and that str(value).strip() is non-empty, and if the check fails raise a descriptive ValueError (or return/skip per surrounding flow) indicating the missing/empty document and include an identifier if available; only then call tokenizer.encode on the validated string and proceed with the existing EOS-token handling and returned dict.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/instructlab/training/config.py (1)

66-75: Pretraining config fields look sound; consider minor naming and doc tightening

The new is_pretraining and pretraining_column_name fields are backward‑compatible and clearly scoped to pretraining, which is good. A couple of small polish ideas:

The naming between config (pretraining_column_name) and the pretraining helper (document_column_name in process_documents_for_pretraining(...) per the summary) is slightly inconsistent. Consider aligning them (e.g., document_column_name everywhere) to reduce cognitive overhead when wiring configs to function parameters.

You may want the pretraining_column_name description to explicitly mention the default and mode dependency, e.g., “Name of the dataset column containing raw documents to pretrain on (default: document). Used only when is_pretraining is True.”

These are non‑blocking but would make the API a bit clearer for users and future maintainers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b6ed96 and dc2f5fd.

📒 Files selected for processing (1)

src/instructlab/training/config.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/instructlab/training/data_process.py (2)
1163-1177: Validate document contents (None/empty/non‑string) before tokenization.

tokenize_document assumes sample[document_column_name] is a non‑empty string. If a row has None, a non‑string, or an empty/whitespace‑only value, tokenizer.encode(...) and especially input_ids[-1] can fail or produce meaningless sequences.

Consider validating and failing fast with a clear error:
-    def tokenize_document(sample):
-        input_ids = tokenizer.encode(
-            sample[document_column_name], add_special_tokens=True
-        )
+    def tokenize_document(sample):
+        doc_text = sample[document_column_name]
+        if doc_text is None or not isinstance(doc_text, str) or not doc_text.strip():
+            raise ValueError(
+                f"Document in column '{document_column_name}' is missing, non-string, or empty"
+            )
+
+        input_ids = tokenizer.encode(doc_text, add_special_tokens=True)
@@
-        # ensures eos token is present without double-adding it.
-        if input_ids[-1] != tokenizer.eos_token_id:
+        # ensures eos token is present without double-adding it.
+        if input_ids and input_ids[-1] != tokenizer.eos_token_id:
             input_ids.append(tokenizer.eos_token_id)
This makes the API more robust to dirty data and avoids opaque tokenizer/indexing errors during large runs.

Also applies to: 1186-1199

1145-1159: Fix docstring to match document_column_name and actual schema.

The docstring currently documents {"documents": "text"}, but the function defaults document_column_name="document" and later validates that this exact field exists.

Recommend aligning the docstring with the actual API:
-    Args:
-        data_path: Path to input JSONL with {"documents": "text"} format
+    Args:
+        data_path: Path to input JSONL with {"document": "text"} format
         data_output_path: Directory for processed data output
         model_path: Path to model/tokenizer
         num_cpu_procs: Number of parallel processes
         document_column_name: Name of the column containing the documents

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc2f5fd and 2d12ab0.

📒 Files selected for processing (1)

src/instructlab/training/data_process.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: pylint
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: Summary

🔇 Additional comments (1)

src/instructlab/training/data_process.py (1)

691-693: Return type verification confirms the code change is correct.

The web search confirms that tokenizer.apply_chat_template(..., return_dict=False) with the default tokenize=True returns a list[int] (or list[list[int]] for batched inputs) rather than a dictionary. This flat token sequence is exactly what's needed for the subsequent indexing and region-masking logic that follows in the unmasking implementation.

coderabbitai · 2025-11-26T06:17:33Z

src/instructlab/training/data_process.py

    data_with_input_ids = data.map(
        lambda x: {
-            "input_ids": tokenizer.apply_chat_template(x["messages"], tokenize=True),
+            # newer versions of transformers have `return_dict=True` by default
+            "input_ids": tokenizer.apply_chat_template(x["messages"], tokenize=True, return_dict=False),
            "unmask": bool(x["unmask"]) if "unmask" in x else False,
        },


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

transformers 4.55.0 PreTrainedTokenizer apply_chat_template default return_dict behavior

<web_search_result>
Short answer: In transformers 4.55.0, PreTrainedTokenizer.apply_chat_template defaults return_dict=False. When you set return_dict=True the method returns the usual tokenizer output dict (input_ids, attention_mask, pixel_values, etc. where applicable); with return_dict=False it returns the token id list (or token array) ready for the model. [1][2][3]

Sources

Transformers tokenizer docs (apply_chat_template signature & param default return_dict=False). [1]

Chat-template docs / internal tokenizer docs describing return_dict behavior and pixel_values handling. [2]

v4.55.0 examples showing usage with return_dict=True. [3]
</web_search_result>

Script executed:

#!/bin/bash # Search for other usages of apply_chat_template in the codebase to understand context rg -n "apply_chat_template" --type=py -A2 -B2

Repository: instructlab/training

<shell_result>
Repository: instructlab/training
Script execution failed with unrecoverable error: [CAPABILITY:RUN_SHELL_SCRIPT] Failed to clone repository instructlab/training: Clone operation failed
</shell_result>

transformers apply_chat_template return_dict default history earlier versions before 4.55.0

<web_search_result>
Short answer: Historically apply_chat_template defaulted to tokenize=True and return_dict=False (so it usually returned token ids or a string). Over a series of changes (notably PR/commit referenced as #36149) callers from Processor/ProcessorMixin started invoking tokenizer.apply_chat_template with tokenize=False and return_dict=False, which effectively forced non-tokenized string output and made options that require tokenization/return_dict (e.g. return_assistant_tokens_mask=True) unusable. See the changelog/issues and docs for details. [1][2][3]

Sources

Issue describing the behavior change and how #36149 hard-codes tokenize=False and return_dict=False (blocks return_assistant_tokens_mask). [1]

apply_chat_template signature/docs showing return_dict default and behavior (return_dict=False by default; return_assistant_tokens_mask requires return_dict=True/tokenize). [2]

Hugging Face chat-templating docs / examples describing apply_chat_template usage and generation-mask behavior. [3]
</web_search_result>

Clarify the rationale: return_dict=False is already the default, not a future-proofing measure.

The inline comment suggests that "newer versions of transformers have return_dict=True by default," but apply_chat_template defaults to return_dict=False in transformers 4.55.0 and historically has done so. The explicit parameter here is redundant but harmless.

If this is intended as defensive coding against hypothetical future changes, consider clarifying or removing the comment to avoid misleading future readers. If the intent is to document current behavior, the comment should state that the parameter preserves the existing default rather than implying a version-specific workaround.

🤖 Prompt for AI Agents

In src/instructlab/training/data_process.py around lines 413 to 418, the inline comment claiming "newer versions of transformers have `return_dict=True` by default" is misleading; update the comment to reflect the actual intent: either remove the comment entirely (since `return_dict=False` is the current default for apply_chat_template) or replace it with a concise note like "explicitly set return_dict=False to preserve current behavior" so readers understand this is preserving existing behavior rather than referencing a specific version change.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

src/instructlab/training/config.py (1)
80-88: Add "public API" comment for consistency.

Other public config classes in this file (e.g., DataProcessArgs, TorchrunArgs, LoraOptions, TrainingArgs) are marked with # public API comments. Consider adding this marker to PretrainingConfig for consistency since it's exposed via TrainingArgs.
+# public API
 class PretrainingConfig(BaseModel):
     """
     Configuration for pretraining mode.
     """
src/instructlab/training/sampler.py (1)
382-387: Clarify get_lengths() vs __getitem__["len"] semantics.

get_lengths() returns the actual token count for the last block (last_block_len), while __getitem__ always returns "len": self.block_size (the padded length). This inconsistency may be intentional:

"len" in samples → padded size for batch allocation

get_lengths() → actual tokens for statistics

Consider adding a docstring to get_lengths() clarifying it returns actual token counts, not padded lengths, to prevent confusion.
     def get_lengths(self) -> np.ndarray:
+        """Return actual token counts per block (not padded lengths)."""
         lengths = np.full(self.num_blocks, self.block_size, dtype=np.int64)
         if self.num_blocks and self.last_block_len != self.block_size:
             lengths[-1] = self.last_block_len
         return lengths
tests/unit/test_pretraining_mode.py (1)

102-153: Consider adding edge case tests.

The test validates the happy path but could benefit from additional edge cases:

Empty document handling

Documents with existing EOS tokens (to verify no double-adding)

Custom document_column_name parameter

These can be added in a follow-up PR if desired.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d12ab0 and 50f8ad8.

📒 Files selected for processing (4)

src/instructlab/training/config.py (2 hunks)
src/instructlab/training/main_ds.py (7 hunks)
src/instructlab/training/sampler.py (5 hunks)
tests/unit/test_pretraining_mode.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/unit/test_pretraining_mode.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-87)

src/instructlab/training/data_process.py (1)

process_documents_for_pretraining (1138-1224)

src/instructlab/training/sampler.py (4)

PretrainingBlockDataset (298-386)

get_data_loader (433-496)

get_lengths (382-386)

get_lengths (429-430)

src/instructlab/training/sampler.py (1)

src/instructlab/training/config.py (1)

PretrainingConfig (80-87)

src/instructlab/training/main_ds.py (2)

src/instructlab/training/config.py (1)

PretrainingConfig (80-87)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1138-1224)

process_data (1452-1495)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (15)

src/instructlab/training/config.py (2)

66-75: LGTM!

The new pretraining fields are well-defined with sensible defaults and clear descriptions. The is_pretraining flag and configurable column name provide good flexibility for the pretraining workflow.

277-283: LGTM!

The pretraining_config field is correctly defined as Optional with a None default, allowing pretraining to be an opt-in feature. The description clearly explains the purpose.

src/instructlab/training/sampler.py (5)

4-22: LGTM!

The new imports are appropriately organized and the logger is correctly initialized at module level.

340-370: Verify labels alignment for causal LM training.

For full blocks (lines 361-363), labels is set identical to input_ids. This pattern relies on the model internally shifting labels during loss computation (common in HuggingFace causal LM models). Confirm this is the expected behavior for the models used with this training library.

Additionally, loss_tokens = self.block_size - 1 correctly accounts for the first token having no prediction target.

372-380: LGTM!

The from_jsonl_file factory method correctly loads the dataset and delegates to the constructor.

444-474: LGTM!

The get_data_loader function correctly branches between PretrainingBlockDataset and TokenDataset based on the presence of pretraining_config. The logging provides helpful context about the pretraining configuration.

298-338: I encountered a repository clone failure, which prevents me from verifying the codebase patterns. However, I can analyze the review comment based on the provided code snippet and context.

Review comment analysis:

The review comment raises a valid technical concern:

The code does concatenate all input_ids into a single Python list (lines 312-317)

This is indeed memory-intensive for large datasets (billions of tokens)

The comment appropriately acknowledges this is acceptable for manageable sizes

It flags this as an optional refactor for future iteration

The concern is technically sound and the suggestion is reasonable. However, without access to the full codebase, I cannot verify:

Whether existing patterns for memory-mapped arrays or lazy loading are already implemented elsewhere

What the expected dataset size ranges are in practice

Whether this concern has already been addressed in other parts of the codebase

Since verification cannot be completed through automated checks, the review comment should be flagged for manual verification.

Memory usage consideration for large datasets.

The implementation concatenates all input_ids into a single Python list in memory (lines 312-317). For very large pretraining corpora (billions of tokens), this could lead to high memory consumption during dataset initialization.

For now, this is acceptable if the expected dataset sizes are manageable. If memory becomes an issue, consider lazy loading with memory-mapped arrays or chunked processing in a future iteration.

tests/unit/test_pretraining_mode.py (2)

24-57: LGTM!

Good test coverage for PretrainingBlockDataset. The test correctly validates:

Block creation with concatenated tokens

Padding behavior for the partial last block

Loss token counting for full vs partial blocks

get_lengths() returning actual token counts

59-100: LGTM!

Comprehensive integration test for the pretraining data loader. The test correctly validates the packed batch structure, sample counts, and loss token aggregation.

src/instructlab/training/main_ds.py (6)

51-52: LGTM!

The PretrainingConfig import is correctly added alongside related config imports.

366-378: LGTM!

The pretraining_config is correctly passed to get_data_loader using defensive getattr access, which handles both CLI-invoked and programmatic calls gracefully.

542-545: LGTM!

The --block-size flag is correctly appended to the subprocess command when pretraining mode is enabled.

788-793: LGTM!

The --block-size CLI argument is well-defined with a clear help message explaining its purpose.

842-845: LGTM!

The CLI argument is correctly parsed into a PretrainingConfig instance when provided, maintaining the same pattern as other config objects.

470-489: I apologize for the technical difficulties with repository access. I'm unable to execute verification scripts due to consistent cloning failures. However, based on the information provided in the review comment itself, I can provide a rewritten version:

Document column name parameter is not exposed through TrainingArgs.

The call to process_documents_for_pretraining doesn't pass the document_column_name parameter (which defaults to "document"), and this parameter is not configurable through TrainingArgs. This prevents users of run_training() from customizing the column name for pretraining workflows.

Consider either:

Adding pretraining_document_column_name to PretrainingConfig or TrainingArgs to allow customization, or

Documenting this as an intentional limitation if simplicity of the API is preferred.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

src/instructlab/training/data_process.py (3)
693-694: Same misleading comment about return_dict default.

This is the same issue as at line 415. Consider updating or removing the comment.

1155-1156: Docstring field name mismatch: should be document (singular).

The docstring says {"documents": "text"} but the code validates for 'document' (the document_column_name default). This was flagged in a prior review and remains unfixed.
-        data_path: Path to input JSONL with {"documents": "text"} format
+        data_path: Path to input JSONL with {"document": "text"} format
1189-1201: Missing validation for empty or None documents.

The tokenize_document function doesn't validate that sample[document_column_name] is non-empty before tokenization. Empty strings produce minimal [BOS][EOS] tokens, and None values will cause tokenizer.encode() to fail. This was flagged in a prior review.
 def tokenize_document(sample):
+    doc_text = sample[document_column_name]
+    if not doc_text or not isinstance(doc_text, str) or not doc_text.strip():
+        raise ValueError(
+            f"Document at column '{document_column_name}' is empty or invalid"
+        )
+
     input_ids = tokenizer.encode(
-        sample[document_column_name], add_special_tokens=True
+        doc_text, add_special_tokens=True
     )

🧹 Nitpick comments (5)

src/instructlab/training/data_process.py (3)
415-418: Comment is misleading: return_dict=False is already the default.

The comment states newer transformers versions have return_dict=True by default, but apply_chat_template has always defaulted to return_dict=False. The explicit parameter is harmless but the comment is incorrect.
-            # newer versions of transformers have `return_dict=True` by default
+            # explicitly set return_dict=False to ensure we get a list of token ids
             "input_ids": tokenizer.apply_chat_template(
                 x["messages"], tokenize=True, return_dict=False
             ),
1210-1215: Use lazy string formatting for logger calls.

The f-string formatting happens unconditionally even if the log level would suppress the message. Use %-style formatting for performance.
-    logger.info(f"Processed {len(tokenized_data):,} documents")
-    logger.info(f"Total tokens: {total_tokens:,}")
-    logger.info(f"Average tokens per document: {avg_tokens:.1f}")
+    logger.info("Processed %s documents", f"{len(tokenized_data):,}")
+    logger.info("Total tokens: %s", f"{total_tokens:,}")
+    logger.info("Average tokens per document: %.1f", avg_tokens)
1225-1226: Same f-string formatting in logger calls.
-    logger.info(f"Saved tokenized documents to {output_file}")
+    logger.info("Saved tokenized documents to %s", output_file)
tests/unit/test_pretraining_mode.py (1)
112-115: StubTokenizer.encode logic may be inverted.

The stub returns base when add_special_tokens=True and base[1:] when False. This implies BOS is always at index 0, but for a stub that doesn't actually prepend a BOS token (it just uses char ordinals), this is semantically inconsistent. However, since the test just validates the pipeline flow rather than tokenizer correctness, this is acceptable for test purposes.

Consider adding a comment clarifying the stub behavior:
 def encode(self, text, add_special_tokens=True):
+    # Stub: simulate BOS by including first char only when add_special_tokens=True
     base = [ord(ch) % 50 + 1 for ch in text]
     return base if add_special_tokens else base[1:]
src/instructlab/training/sampler.py (1)
375-383: Consider adding error handling for better user experience.

The from_jsonl_file method lacks explicit error handling. If the file is missing or malformed, users will receive raw exceptions from load_dataset. Consider adding try/except blocks with clearer error messages.

Example:
 @classmethod
 def from_jsonl_file(
     cls,
     data_path: str,
     block_size: int,
     pad_token_id: int,
 ) -> "PretrainingBlockDataset":
+    try:
         dataset = load_dataset("json", data_files=data_path, split="train")
+    except FileNotFoundError:
+        raise FileNotFoundError(f"Pretraining data file not found: {data_path}")
+    except Exception as e:
+        raise ValueError(f"Failed to load pretraining data from {data_path}: {e}")
     return cls(dataset, block_size, pad_token_id)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50f8ad8 and 75aa3bc.

📒 Files selected for processing (6)

src/instructlab/training/__init__.py (2 hunks)
src/instructlab/training/config.py (2 hunks)
src/instructlab/training/data_process.py (3 hunks)
src/instructlab/training/main_ds.py (7 hunks)
src/instructlab/training/sampler.py (5 hunks)
tests/unit/test_pretraining_mode.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

src/instructlab/training/__init__.py (1)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

tests/unit/test_pretraining_mode.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1140-1226)

main (1500-1517)

src/instructlab/training/sampler.py (4)

PretrainingBlockDataset (299-389)

get_data_loader (436-499)

get_lengths (385-389)

get_lengths (432-433)

src/instructlab/training/sampler.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/padded_batch_packer.py (1)

batch_lengths_to_minibatches_padded (200-244)

src/instructlab/training/type_definitions.py (1)

CollatedItem (49-61)

src/instructlab/training/main_ds.py (2)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1140-1226)

process_data (1454-1497)

src/instructlab/training/data_process.py (1)

tests/unit/test_pretraining_mode.py (1)

encode (113-115)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (19)

src/instructlab/training/data_process.py (1)

1194-1196: Good: EOS guard now prevents duplicate EOS tokens.

The guard if input_ids[-1] != tokenizer.eos_token_id correctly prevents double-appending EOS. This addresses the previous review feedback.

src/instructlab/training/__init__.py (1)

13-14: LGTM: PretrainingConfig correctly exposed in public API.

The new PretrainingConfig is properly added to __all__ and imported from .config, making it available for external users.

Also applies to: 27-28

src/instructlab/training/config.py (3)

80-92: LGTM: PretrainingConfig is well-defined.

The PretrainingConfig class has a required block_size field (no default forces explicit specification) and a sensible default for document_column_name. Good design for a configuration that requires explicit opt-in.

281-287: LGTM: Optional pretraining_config integration in TrainingArgs.

The pretraining_config field is correctly typed as Optional[PretrainingConfig] with None default, allowing backward compatibility while enabling pretraining mode when explicitly provided.

66-75: I'll verify whether the is_pretraining and pretraining_column_name fields are actually used in the codebase or represent dead code.
<function_calls>

#!/bin/bash

Search for all usages of is_pretraining and pretraining_column_name

rg -n "is_pretraining|pretraining_column_name" --type=py -C3

</function_calls>
<function_calls>

#!/bin/bash

Check for process_documents_for_pretraining function and its usage

rg -n "process_documents_for_pretraining|def process_" --type=py -C5

</function_calls>

tests/unit/test_pretraining_mode.py (3)

27-60: LGTM: Comprehensive test for PretrainingBlockDataset.

The test correctly validates:

Block count calculation (7 tokens → 2 blocks of size 4)

First block: full block with num_loss_counted_tokens = block_size - 1 = 3

Second block: partial with padding, correct labels (-100 for pad), and num_loss_counted_tokens = 2

get_lengths() returns actual token counts per block

63-104: LGTM: Good integration test for pretraining data loader.

The test validates:

PretrainingBlockDataset is correctly instantiated when pretraining_config is provided

Batch structure includes expected keys and shapes

Loss token counting is propagated correctly

106-157: LGTM: End-to-end test for document processing pipeline.

The test properly:

Mocks AutoTokenizer.from_pretrained to isolate the tokenization logic

Validates output JSONL structure (input_ids, len fields)

Confirms EOS token is appended correctly

src/instructlab/training/main_ds.py (6)

49-52: LGTM: PretrainingConfig import added correctly.

366-378: LGTM: Data loader integration with pretraining config.

The get_data_loader call correctly passes pretraining_config using getattr with a None fallback, ensuring backward compatibility when the attribute is not present.

470-490: LGTM: Correct branching for pretraining data processing.

When pretraining_config is provided, the code correctly calls process_documents_for_pretraining with the appropriate parameters including document_column_name from the config. Otherwise, it falls back to the standard process_data path.

543-547: LGTM: Command construction propagates pretraining flags.

The CLI arguments --block-size and --document-column-name are correctly appended when pretraining_config is provided.

790-801: LGTM: CLI arguments for pretraining mode.

New --block-size and --document-column-name arguments are correctly defined with appropriate help text.

850-861: LGTM: Validation and PretrainingConfig construction from CLI args.

Good validation that --document-column-name requires --block-size. The PretrainingConfig is correctly constructed only when block_size is provided, and document_column_name is included via kwargs when specified.

src/instructlab/training/sampler.py (5)

5-5: LGTM: Clean imports for pretraining support.

The added imports and logger initialization are appropriate for the new pretraining functionality.

Also applies to: 8-8, 17-17, 23-24

385-389: LGTM: Correct length calculation.

The get_lengths method correctly handles both uniform blocks and the partial last block.

447-447: LGTM: Clean integration of pretraining support.

The conditional logic cleanly separates pretraining and fine-tuning dataset construction. The logging is helpful, and backward compatibility is maintained with the optional parameter.

Also applies to: 462-462, 467-478

362-362: I'm unable to complete the verification task as the repository failed to clone. Without direct access to the codebase, I cannot examine the referenced code sections to verify the loss token counting logic.

However, based on the review comment itself, which is well-reasoned and identifies a specific concern about consistency in loss token counting between PretrainingBlockDataset (lines 362, 366) and TokenDataset (lines 421-423), this requires manual verification by someone with access to the repository.

Loss token counting consistency requires verification against model loss computation.

The comment correctly identifies that for causal language modeling, N tokens should produce N-1 predictions. However, the consistency between PretrainingBlockDataset (which uses N-1 via max(actual_len - 1, 0) and self.block_size - 1) and TokenDataset (which counts labels != -100) needs to be verified by reviewing how the model actually computes loss and whether the -1 is applied consistently across both pretraining and fine-tuning modes.

313-318: I encountered technical difficulties accessing the repository to complete the verification. However, I can provide an assessment based on the code snippet provided:

Memory loading pattern confirmed, but verification incomplete.

The code snippet does exhibit the memory concern described—it loads the entire dataset's input_ids into a single list during initialization. While this pattern is technically valid for smaller datasets, I cannot verify:

Target dataset sizes for this project

Whether this is a documented limitation or known issue

If alternative lazy-loading patterns exist elsewhere in the codebase

Real-world usage constraints and typical dataset sizes

Given that I cannot access the repository to validate the severity and context of this concern, manual verification from the development team is needed to confirm whether this is a blocker for your intended use cases.

adds API for converting documents into pretraining data

2b3c3ef

mergify bot added the ci-failure label Nov 25, 2025

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

removes extraneous validation + account for eos already being present

6b6ed96

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

exposes the pretraining column name in training config

dc2f5fd

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

explicitly do not request a dict

2d12ab0

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

adds the ability for training to consume pretraining forrmat data

50f8ad8

mergify bot added the testing Relates to testing label Nov 26, 2025

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

linting + minor fixes

75aa3bc

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation ci-failure and removed ci-failure labels Nov 26, 2025

add docs

fede3e7

RobotSail force-pushed the os-dp-cpt branch from 7904601 to fede3e7 Compare November 26, 2025 16:27

updates mock tokenizer

27c9f55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exposes API for processing pretraining data #672

Exposes API for processing pretraining data #672

Uh oh!

RobotSail commented Nov 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 25, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 26, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 26, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Search for all usages of is_pretraining and pretraining_column_name

Check for process_documents_for_pretraining function and its usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	data_path: Path to input JSONL with {"documents": "text"} format
	data_path: Path to input JSONL with {"document": "text"} format

Exposes API for processing pretraining data #672

Are you sure you want to change the base?

Exposes API for processing pretraining data #672

Uh oh!

Conversation

RobotSail commented Nov 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Search for all usages of is_pretraining and pretraining_column_name

Check for process_documents_for_pretraining function and its usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RobotSail commented Nov 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 25, 2025 •

edited

Loading