-
Notifications
You must be signed in to change notification settings - Fork 75
Exposes API for processing pretraining data #672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdded continual pretraining support: new PretrainingConfig and related TrainingArgs field, document-to-JSONL pretraining processor, PretrainingBlockDataset and loader branch for fixed-size block sampling, CLI/training plumbing for block-size and document-column-name, public export of PretrainingConfig, and unit tests for the pretraining flow. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as CLI / Orchestrator
participant Main as main_ds.train
participant Proc as data_process.process_documents_for_pretraining
participant Loader as sampler.get_data_loader
participant Trainer as DistributedTrainer
Note over Main: start training orchestration
CLI->>Main: parse args (may include --block-size, --document-column-name)
Main->>Main: set train_args.pretraining_config (optional)
alt pretraining_config provided
Main->>Proc: process_documents_for_pretraining(data_path, out_path, model_path, ...)
Proc-->>Main: writes data.jsonl (tokenized documents)
Main->>Loader: get_data_loader(..., pretraining_config)
Loader-->>Main: returns DataLoader(PretrainingBlockDataset)
Main->>Trainer: assemble training command with --block-size & --document-column-name
else standard training
Main->>Proc: process_data(...) (existing pipeline)
Main->>Loader: get_data_loader(..., pretraining_config=None)
Loader-->>Main: returns DataLoader(TokenDataset)
Main->>Trainer: assemble standard training command
end
Main->>Trainer: launch distributed training
Trainer-->>Main: monitor/exit
sequenceDiagram
autonumber
participant Caller as process_documents_for_pretraining()
participant FS as FileSystem
participant Tok as AutoTokenizer
Note over Caller: tokenization -> jsonl output per document
Caller->>FS: open/read JSON dataset
FS-->>Caller: dataset
Caller->>Tok: AutoTokenizer.from_pretrained(model_path)
Tok-->>Caller: tokenizer (must provide eos_token_id)
loop per document
Caller->>Tok: tokenizer(doc, add_special_tokens=True, return_dict=False)
Tok-->>Caller: input_ids
Caller->>Caller: append EOS if missing
Caller->>FS: append `{ "input_ids": [...], "len": N }` to data.jsonl
end
Caller->>FS: close/write complete
Caller-->>Caller: log totals and average token length
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/instructlab/training/data_process.py (2)
1202-1204: Prefer lazy formatting in logger calls.Using f-strings in
logger.info()evaluates the string even when the log level is disabled. Use lazy formatting for better performance.- logger.info(f"Processed {len(tokenized_data):,} documents") - logger.info(f"Total tokens: {total_tokens:,}") - logger.info(f"Average tokens per document: {avg_tokens:.1f}") + logger.info("Processed %s documents", f"{len(tokenized_data):,}") + logger.info("Total tokens: %s", f"{total_tokens:,}") + logger.info("Average tokens per document: %.1f", avg_tokens)
1217-1218: Same lazy formatting suggestion for these logger calls.- logger.info(f"Saved tokenized documents to {output_file}") + logger.info("Saved tokenized documents to %s", output_file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/instructlab/training/config.py(1 hunks)src/instructlab/training/data_process.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (4)
src/instructlab/training/config.py (2)
66-70: LGTM!The
is_pretrainingfield is well-defined with a clear description that explains its purpose.
75-80: LGTM!The validator correctly enforces mutual exclusivity between
is_pretrainingmode andchat_tmpl_path, which aligns with the pretraining workflow that tokenizes raw documents without chat templates.src/instructlab/training/data_process.py (2)
1156-1179: LGTM!Good defensive programming with proper validation for write permissions, dataset format, field requirements, and tokenizer EOS token.
1136-1141: Verify integration:is_pretrainingflag not wired to this function.The
DataProcessArgs.is_pretrainingflag was added inconfig.py, but neithermain()norprocess_data()in this file dispatches toprocess_documents_for_pretrainingbased on that flag. Users would need to call this function directly.If this is intentional (exposing API only), the approach is fine. Otherwise, consider adding integration in
process_data():def process_data(..., is_pretraining: bool = False): if is_pretraining: process_documents_for_pretraining( data_path=data_path, data_output_path=data_output_path, model_path=model_path, num_cpu_procs=num_cpu_procs, ) elif chat_tmpl_path: # existing legacy path...
| Pattern: Each document → [BOS][tokens][EOS] | ||
| Args: | ||
| data_path: Path to input JSONL with {"documents": "text"} format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix documentation: field name should be document (singular).
The docstring says {"documents": "text"} but Line 1170 validates for 'document' (singular). This inconsistency will confuse API users.
- data_path: Path to input JSONL with {"documents": "text"} format
+ data_path: Path to input JSONL with {"document": "text"} format📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| data_path: Path to input JSONL with {"documents": "text"} format | |
| data_path: Path to input JSONL with {"document": "text"} format |
🤖 Prompt for AI Agents
In src/instructlab/training/data_process.py around line 1151 (and nearby
docstring area), the docstring incorrectly documents the input JSONL as
{"documents": "text"} while the code validates for the key 'document'
(singular); update the docstring to {"document": "text"} to match the validation
(or alternatively change the validation to accept 'documents' if you intend
plural), and ensure both the docstring and validation use the same field name so
API users are not confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/instructlab/training/data_process.py (1)
1152-1152: Fix documentation: field name should bedocument(singular).The docstring says
{"documents": "text"}but the code validates for'document'(singular) at Line 1171 and the parameter defaults to"document". This inconsistency will confuse API users.- data_path: Path to input JSONL with {"documents": "text"} format + data_path: Path to input JSONL with {"document": "text"} format
🧹 Nitpick comments (1)
src/instructlab/training/data_process.py (1)
1176-1180: Consider adding error handling for tokenizer loading and validating BOS token.For consistency with the dataset loading pattern (lines 1161-1166), consider wrapping the tokenizer loading in a try/except. Additionally, since the docstring at Line 1149 documents that documents follow the
[BOS][tokens][EOS]pattern, consider validating thattokenizer.bos_token_idexists, similar to the EOS validation.logger.info("Loading tokenizer from %s", model_path) -tokenizer = AutoTokenizer.from_pretrained(model_path) +try: + tokenizer = AutoTokenizer.from_pretrained(model_path) +except Exception as e: + raise ValueError( + f"Failed to load tokenizer from {model_path}" + ) from e +if tokenizer.bos_token_id is None: + logger.warning("Tokenizer does not have a BOS token defined") + if tokenizer.eos_token_id is None: raise ValueError("Tokenizer must have an EOS token defined for pretraining")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/instructlab/training/config.py(1 hunks)src/instructlab/training/data_process.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/instructlab/training/config.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (1)
src/instructlab/training/data_process.py (1)
1190-1192: Good defensive guard against duplicate EOS tokens.The guard correctly prevents duplicate EOS tokens, addressing the concern from the previous review. This handles tokenizers that may or may not automatically append EOS with
add_special_tokens=True.
| def tokenize_document(sample): | ||
| input_ids = tokenizer.encode( | ||
| sample[document_column_name], add_special_tokens=True | ||
| ) | ||
|
|
||
| # ensures eos token is present without double-adding it. | ||
| if input_ids[-1] != tokenizer.eos_token_id: | ||
| input_ids.append(tokenizer.eos_token_id) | ||
|
|
||
| return { | ||
| "input_ids": input_ids, | ||
| "len": len(input_ids), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add validation for empty or None documents.
The code doesn't validate that sample[document_column_name] is non-empty before tokenization. Empty strings would produce only [BOS][EOS] tokens, and None values would cause tokenizer.encode() to fail.
Apply this diff to add validation:
def tokenize_document(sample):
+ doc_text = sample[document_column_name]
+ if not doc_text or not doc_text.strip():
+ raise ValueError(f"Document at column '{document_column_name}' is empty or None")
+
input_ids = tokenizer.encode(
- sample[document_column_name], add_special_tokens=True
+ doc_text, add_special_tokens=True
)🤖 Prompt for AI Agents
In src/instructlab/training/data_process.py around lines 1185 to 1197, the
tokenize_document function does not validate sample[document_column_name] before
calling tokenizer.encode which will fail on None and produce invalid output for
empty strings; update the function to first retrieve the field into a local
variable, check that it is not None and that str(value).strip() is non-empty,
and if the check fails raise a descriptive ValueError (or return/skip per
surrounding flow) indicating the missing/empty document and include an
identifier if available; only then call tokenizer.encode on the validated string
and proceed with the existing EOS-token handling and returned dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/instructlab/training/config.py (1)
66-75: Pretraining config fields look sound; consider minor naming and doc tighteningThe new
is_pretrainingandpretraining_column_namefields are backward‑compatible and clearly scoped to pretraining, which is good. A couple of small polish ideas:
- The naming between config (
pretraining_column_name) and the pretraining helper (document_column_nameinprocess_documents_for_pretraining(...)per the summary) is slightly inconsistent. Consider aligning them (e.g.,document_column_nameeverywhere) to reduce cognitive overhead when wiring configs to function parameters.- You may want the
pretraining_column_namedescription to explicitly mention the default and mode dependency, e.g., “Name of the dataset column containing raw documents to pretrain on (default:document). Used only whenis_pretrainingis True.”These are non‑blocking but would make the API a bit clearer for users and future maintainers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/instructlab/training/config.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/instructlab/training/data_process.py (2)
1163-1177: Validate document contents (None/empty/non‑string) before tokenization.
tokenize_documentassumessample[document_column_name]is a non‑empty string. If a row hasNone, a non‑string, or an empty/whitespace‑only value,tokenizer.encode(...)and especiallyinput_ids[-1]can fail or produce meaningless sequences.Consider validating and failing fast with a clear error:
- def tokenize_document(sample): - input_ids = tokenizer.encode( - sample[document_column_name], add_special_tokens=True - ) + def tokenize_document(sample): + doc_text = sample[document_column_name] + if doc_text is None or not isinstance(doc_text, str) or not doc_text.strip(): + raise ValueError( + f"Document in column '{document_column_name}' is missing, non-string, or empty" + ) + + input_ids = tokenizer.encode(doc_text, add_special_tokens=True) @@ - # ensures eos token is present without double-adding it. - if input_ids[-1] != tokenizer.eos_token_id: + # ensures eos token is present without double-adding it. + if input_ids and input_ids[-1] != tokenizer.eos_token_id: input_ids.append(tokenizer.eos_token_id)This makes the API more robust to dirty data and avoids opaque tokenizer/indexing errors during large runs.
Also applies to: 1186-1199
1145-1159: Fix docstring to matchdocument_column_nameand actual schema.The docstring currently documents
{"documents": "text"}, but the function defaultsdocument_column_name="document"and later validates that this exact field exists.Recommend aligning the docstring with the actual API:
- Args: - data_path: Path to input JSONL with {"documents": "text"} format + Args: + data_path: Path to input JSONL with {"document": "text"} format data_output_path: Directory for processed data output model_path: Path to model/tokenizer num_cpu_procs: Number of parallel processes document_column_name: Name of the column containing the documents
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/instructlab/training/data_process.py(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: pylint
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: Summary
🔇 Additional comments (1)
src/instructlab/training/data_process.py (1)
691-693: Return type verification confirms the code change is correct.The web search confirms that
tokenizer.apply_chat_template(..., return_dict=False)with the defaulttokenize=Truereturns alist[int](orlist[list[int]]for batched inputs) rather than a dictionary. This flat token sequence is exactly what's needed for the subsequent indexing and region-masking logic that follows in the unmasking implementation.
| data_with_input_ids = data.map( | ||
| lambda x: { | ||
| "input_ids": tokenizer.apply_chat_template(x["messages"], tokenize=True), | ||
| # newer versions of transformers have `return_dict=True` by default | ||
| "input_ids": tokenizer.apply_chat_template(x["messages"], tokenize=True, return_dict=False), | ||
| "unmask": bool(x["unmask"]) if "unmask" in x else False, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
transformers 4.55.0 PreTrainedTokenizer apply_chat_template default return_dict behavior
<web_search_result>
Short answer: In transformers 4.55.0, PreTrainedTokenizer.apply_chat_template defaults return_dict=False. When you set return_dict=True the method returns the usual tokenizer output dict (input_ids, attention_mask, pixel_values, etc. where applicable); with return_dict=False it returns the token id list (or token array) ready for the model. [1][2][3]
Sources
- Transformers tokenizer docs (apply_chat_template signature & param default return_dict=False). [1]
- Chat-template docs / internal tokenizer docs describing return_dict behavior and pixel_values handling. [2]
- v4.55.0 examples showing usage with return_dict=True. [3]
</web_search_result>
Script executed:
#!/bin/bash
# Search for other usages of apply_chat_template in the codebase to understand context
rg -n "apply_chat_template" --type=py -A2 -B2Repository: instructlab/training
<shell_result>
Repository: instructlab/training
Script execution failed with unrecoverable error: [CAPABILITY:RUN_SHELL_SCRIPT] Failed to clone repository instructlab/training: Clone operation failed
</shell_result>
transformers apply_chat_template return_dict default history earlier versions before 4.55.0
<web_search_result>
Short answer: Historically apply_chat_template defaulted to tokenize=True and return_dict=False (so it usually returned token ids or a string). Over a series of changes (notably PR/commit referenced as #36149) callers from Processor/ProcessorMixin started invoking tokenizer.apply_chat_template with tokenize=False and return_dict=False, which effectively forced non-tokenized string output and made options that require tokenization/return_dict (e.g. return_assistant_tokens_mask=True) unusable. See the changelog/issues and docs for details. [1][2][3]
Sources
- Issue describing the behavior change and how #36149 hard-codes tokenize=False and return_dict=False (blocks return_assistant_tokens_mask). [1]
- apply_chat_template signature/docs showing return_dict default and behavior (return_dict=False by default; return_assistant_tokens_mask requires return_dict=True/tokenize). [2]
- Hugging Face chat-templating docs / examples describing apply_chat_template usage and generation-mask behavior. [3]
</web_search_result>
Clarify the rationale: return_dict=False is already the default, not a future-proofing measure.
The inline comment suggests that "newer versions of transformers have return_dict=True by default," but apply_chat_template defaults to return_dict=False in transformers 4.55.0 and historically has done so. The explicit parameter here is redundant but harmless.
If this is intended as defensive coding against hypothetical future changes, consider clarifying or removing the comment to avoid misleading future readers. If the intent is to document current behavior, the comment should state that the parameter preserves the existing default rather than implying a version-specific workaround.
🤖 Prompt for AI Agents
In src/instructlab/training/data_process.py around lines 413 to 418, the inline
comment claiming "newer versions of transformers have `return_dict=True` by
default" is misleading; update the comment to reflect the actual intent: either
remove the comment entirely (since `return_dict=False` is the current default
for apply_chat_template) or replace it with a concise note like "explicitly set
return_dict=False to preserve current behavior" so readers understand this is
preserving existing behavior rather than referencing a specific version change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/instructlab/training/config.py (1)
80-88: Add "public API" comment for consistency.Other public config classes in this file (e.g.,
DataProcessArgs,TorchrunArgs,LoraOptions,TrainingArgs) are marked with# public APIcomments. Consider adding this marker toPretrainingConfigfor consistency since it's exposed viaTrainingArgs.+# public API class PretrainingConfig(BaseModel): """ Configuration for pretraining mode. """src/instructlab/training/sampler.py (1)
382-387: Clarifyget_lengths()vs__getitem__["len"]semantics.
get_lengths()returns the actual token count for the last block (last_block_len), while__getitem__always returns"len": self.block_size(the padded length). This inconsistency may be intentional:
"len"in samples → padded size for batch allocationget_lengths()→ actual tokens for statisticsConsider adding a docstring to
get_lengths()clarifying it returns actual token counts, not padded lengths, to prevent confusion.def get_lengths(self) -> np.ndarray: + """Return actual token counts per block (not padded lengths).""" lengths = np.full(self.num_blocks, self.block_size, dtype=np.int64) if self.num_blocks and self.last_block_len != self.block_size: lengths[-1] = self.last_block_len return lengthstests/unit/test_pretraining_mode.py (1)
102-153: Consider adding edge case tests.The test validates the happy path but could benefit from additional edge cases:
- Empty document handling
- Documents with existing EOS tokens (to verify no double-adding)
- Custom
document_column_nameparameterThese can be added in a follow-up PR if desired.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/instructlab/training/config.py(2 hunks)src/instructlab/training/main_ds.py(7 hunks)src/instructlab/training/sampler.py(5 hunks)tests/unit/test_pretraining_mode.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/unit/test_pretraining_mode.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-87)src/instructlab/training/data_process.py (1)
process_documents_for_pretraining(1138-1224)src/instructlab/training/sampler.py (4)
PretrainingBlockDataset(298-386)get_data_loader(433-496)get_lengths(382-386)get_lengths(429-430)
src/instructlab/training/sampler.py (1)
src/instructlab/training/config.py (1)
PretrainingConfig(80-87)
src/instructlab/training/main_ds.py (2)
src/instructlab/training/config.py (1)
PretrainingConfig(80-87)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1138-1224)process_data(1452-1495)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (15)
src/instructlab/training/config.py (2)
66-75: LGTM!The new pretraining fields are well-defined with sensible defaults and clear descriptions. The
is_pretrainingflag and configurable column name provide good flexibility for the pretraining workflow.
277-283: LGTM!The
pretraining_configfield is correctly defined asOptionalwith aNonedefault, allowing pretraining to be an opt-in feature. The description clearly explains the purpose.src/instructlab/training/sampler.py (5)
4-22: LGTM!The new imports are appropriately organized and the logger is correctly initialized at module level.
340-370: Verify labels alignment for causal LM training.For full blocks (lines 361-363),
labelsis set identical toinput_ids. This pattern relies on the model internally shifting labels during loss computation (common in HuggingFace causal LM models). Confirm this is the expected behavior for the models used with this training library.Additionally,
loss_tokens = self.block_size - 1correctly accounts for the first token having no prediction target.
372-380: LGTM!The
from_jsonl_filefactory method correctly loads the dataset and delegates to the constructor.
444-474: LGTM!The
get_data_loaderfunction correctly branches betweenPretrainingBlockDatasetandTokenDatasetbased on the presence ofpretraining_config. The logging provides helpful context about the pretraining configuration.
298-338: I encountered a repository clone failure, which prevents me from verifying the codebase patterns. However, I can analyze the review comment based on the provided code snippet and context.Review comment analysis:
The review comment raises a valid technical concern:
- The code does concatenate all
input_idsinto a single Python list (lines 312-317)- This is indeed memory-intensive for large datasets (billions of tokens)
- The comment appropriately acknowledges this is acceptable for manageable sizes
- It flags this as an optional refactor for future iteration
The concern is technically sound and the suggestion is reasonable. However, without access to the full codebase, I cannot verify:
- Whether existing patterns for memory-mapped arrays or lazy loading are already implemented elsewhere
- What the expected dataset size ranges are in practice
- Whether this concern has already been addressed in other parts of the codebase
Since verification cannot be completed through automated checks, the review comment should be flagged for manual verification.
Memory usage consideration for large datasets.
The implementation concatenates all
input_idsinto a single Python list in memory (lines 312-317). For very large pretraining corpora (billions of tokens), this could lead to high memory consumption during dataset initialization.For now, this is acceptable if the expected dataset sizes are manageable. If memory becomes an issue, consider lazy loading with memory-mapped arrays or chunked processing in a future iteration.
tests/unit/test_pretraining_mode.py (2)
24-57: LGTM!Good test coverage for
PretrainingBlockDataset. The test correctly validates:
- Block creation with concatenated tokens
- Padding behavior for the partial last block
- Loss token counting for full vs partial blocks
get_lengths()returning actual token counts
59-100: LGTM!Comprehensive integration test for the pretraining data loader. The test correctly validates the packed batch structure, sample counts, and loss token aggregation.
src/instructlab/training/main_ds.py (6)
51-52: LGTM!The
PretrainingConfigimport is correctly added alongside related config imports.
366-378: LGTM!The
pretraining_configis correctly passed toget_data_loaderusing defensivegetattraccess, which handles both CLI-invoked and programmatic calls gracefully.
542-545: LGTM!The
--block-sizeflag is correctly appended to the subprocess command when pretraining mode is enabled.
788-793: LGTM!The
--block-sizeCLI argument is well-defined with a clear help message explaining its purpose.
842-845: LGTM!The CLI argument is correctly parsed into a
PretrainingConfiginstance when provided, maintaining the same pattern as other config objects.
470-489: I apologize for the technical difficulties with repository access. I'm unable to execute verification scripts due to consistent cloning failures. However, based on the information provided in the review comment itself, I can provide a rewritten version:Document column name parameter is not exposed through
TrainingArgs.The call to
process_documents_for_pretrainingdoesn't pass thedocument_column_nameparameter (which defaults to"document"), and this parameter is not configurable throughTrainingArgs. This prevents users ofrun_training()from customizing the column name for pretraining workflows.Consider either:
- Adding
pretraining_document_column_nametoPretrainingConfigorTrainingArgsto allow customization, or- Documenting this as an intentional limitation if simplicity of the API is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/instructlab/training/data_process.py (3)
693-694: Same misleading comment aboutreturn_dictdefault.This is the same issue as at line 415. Consider updating or removing the comment.
1155-1156: Docstring field name mismatch: should bedocument(singular).The docstring says
{"documents": "text"}but the code validates for'document'(thedocument_column_namedefault). This was flagged in a prior review and remains unfixed.- data_path: Path to input JSONL with {"documents": "text"} format + data_path: Path to input JSONL with {"document": "text"} format
1189-1201: Missing validation for empty or None documents.The
tokenize_documentfunction doesn't validate thatsample[document_column_name]is non-empty before tokenization. Empty strings produce minimal[BOS][EOS]tokens, andNonevalues will causetokenizer.encode()to fail. This was flagged in a prior review.def tokenize_document(sample): + doc_text = sample[document_column_name] + if not doc_text or not isinstance(doc_text, str) or not doc_text.strip(): + raise ValueError( + f"Document at column '{document_column_name}' is empty or invalid" + ) + input_ids = tokenizer.encode( - sample[document_column_name], add_special_tokens=True + doc_text, add_special_tokens=True )
🧹 Nitpick comments (5)
src/instructlab/training/data_process.py (3)
415-418: Comment is misleading:return_dict=Falseis already the default.The comment states newer transformers versions have
return_dict=Trueby default, butapply_chat_templatehas always defaulted toreturn_dict=False. The explicit parameter is harmless but the comment is incorrect.- # newer versions of transformers have `return_dict=True` by default + # explicitly set return_dict=False to ensure we get a list of token ids "input_ids": tokenizer.apply_chat_template( x["messages"], tokenize=True, return_dict=False ),
1210-1215: Use lazy string formatting for logger calls.The f-string formatting happens unconditionally even if the log level would suppress the message. Use
%-style formatting for performance.- logger.info(f"Processed {len(tokenized_data):,} documents") - logger.info(f"Total tokens: {total_tokens:,}") - logger.info(f"Average tokens per document: {avg_tokens:.1f}") + logger.info("Processed %s documents", f"{len(tokenized_data):,}") + logger.info("Total tokens: %s", f"{total_tokens:,}") + logger.info("Average tokens per document: %.1f", avg_tokens)
1225-1226: Same f-string formatting in logger calls.- logger.info(f"Saved tokenized documents to {output_file}") + logger.info("Saved tokenized documents to %s", output_file)tests/unit/test_pretraining_mode.py (1)
112-115: StubTokenizer.encode logic may be inverted.The stub returns
basewhenadd_special_tokens=Trueandbase[1:]whenFalse. This implies BOS is always at index 0, but for a stub that doesn't actually prepend a BOS token (it just uses char ordinals), this is semantically inconsistent. However, since the test just validates the pipeline flow rather than tokenizer correctness, this is acceptable for test purposes.Consider adding a comment clarifying the stub behavior:
def encode(self, text, add_special_tokens=True): + # Stub: simulate BOS by including first char only when add_special_tokens=True base = [ord(ch) % 50 + 1 for ch in text] return base if add_special_tokens else base[1:]src/instructlab/training/sampler.py (1)
375-383: Consider adding error handling for better user experience.The
from_jsonl_filemethod lacks explicit error handling. If the file is missing or malformed, users will receive raw exceptions fromload_dataset. Consider adding try/except blocks with clearer error messages.Example:
@classmethod def from_jsonl_file( cls, data_path: str, block_size: int, pad_token_id: int, ) -> "PretrainingBlockDataset": + try: dataset = load_dataset("json", data_files=data_path, split="train") + except FileNotFoundError: + raise FileNotFoundError(f"Pretraining data file not found: {data_path}") + except Exception as e: + raise ValueError(f"Failed to load pretraining data from {data_path}: {e}") return cls(dataset, block_size, pad_token_id)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/instructlab/training/__init__.py(2 hunks)src/instructlab/training/config.py(2 hunks)src/instructlab/training/data_process.py(3 hunks)src/instructlab/training/main_ds.py(7 hunks)src/instructlab/training/sampler.py(5 hunks)tests/unit/test_pretraining_mode.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
src/instructlab/training/__init__.py (1)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)
tests/unit/test_pretraining_mode.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1140-1226)main(1500-1517)src/instructlab/training/sampler.py (4)
PretrainingBlockDataset(299-389)get_data_loader(436-499)get_lengths(385-389)get_lengths(432-433)
src/instructlab/training/sampler.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/padded_batch_packer.py (1)
batch_lengths_to_minibatches_padded(200-244)src/instructlab/training/type_definitions.py (1)
CollatedItem(49-61)
src/instructlab/training/main_ds.py (2)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1140-1226)process_data(1454-1497)
src/instructlab/training/data_process.py (1)
tests/unit/test_pretraining_mode.py (1)
encode(113-115)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (19)
src/instructlab/training/data_process.py (1)
1194-1196: Good: EOS guard now prevents duplicate EOS tokens.The guard
if input_ids[-1] != tokenizer.eos_token_idcorrectly prevents double-appending EOS. This addresses the previous review feedback.src/instructlab/training/__init__.py (1)
13-14: LGTM: PretrainingConfig correctly exposed in public API.The new
PretrainingConfigis properly added to__all__and imported from.config, making it available for external users.Also applies to: 27-28
src/instructlab/training/config.py (3)
80-92: LGTM: PretrainingConfig is well-defined.The
PretrainingConfigclass has a requiredblock_sizefield (no default forces explicit specification) and a sensible default fordocument_column_name. Good design for a configuration that requires explicit opt-in.
281-287: LGTM: Optional pretraining_config integration in TrainingArgs.The
pretraining_configfield is correctly typed asOptional[PretrainingConfig]withNonedefault, allowing backward compatibility while enabling pretraining mode when explicitly provided.
66-75: I'll verify whether theis_pretrainingandpretraining_column_namefields are actually used in the codebase or represent dead code.
<function_calls>
#!/bin/bashSearch for all usages of is_pretraining and pretraining_column_name
rg -n "is_pretraining|pretraining_column_name" --type=py -C3
</function_calls>
<function_calls>
#!/bin/bashCheck for process_documents_for_pretraining function and its usage
rg -n "process_documents_for_pretraining|def process_" --type=py -C5
</function_calls>tests/unit/test_pretraining_mode.py (3)
27-60: LGTM: Comprehensive test for PretrainingBlockDataset.The test correctly validates:
- Block count calculation (7 tokens → 2 blocks of size 4)
- First block: full block with
num_loss_counted_tokens = block_size - 1 = 3- Second block: partial with padding, correct labels (-100 for pad), and
num_loss_counted_tokens = 2get_lengths()returns actual token counts per block
63-104: LGTM: Good integration test for pretraining data loader.The test validates:
PretrainingBlockDatasetis correctly instantiated whenpretraining_configis provided- Batch structure includes expected keys and shapes
- Loss token counting is propagated correctly
106-157: LGTM: End-to-end test for document processing pipeline.The test properly:
- Mocks
AutoTokenizer.from_pretrainedto isolate the tokenization logic- Validates output JSONL structure (input_ids, len fields)
- Confirms EOS token is appended correctly
src/instructlab/training/main_ds.py (6)
49-52: LGTM: PretrainingConfig import added correctly.
366-378: LGTM: Data loader integration with pretraining config.The
get_data_loadercall correctly passespretraining_configusinggetattrwith aNonefallback, ensuring backward compatibility when the attribute is not present.
470-490: LGTM: Correct branching for pretraining data processing.When
pretraining_configis provided, the code correctly callsprocess_documents_for_pretrainingwith the appropriate parameters includingdocument_column_namefrom the config. Otherwise, it falls back to the standardprocess_datapath.
543-547: LGTM: Command construction propagates pretraining flags.The CLI arguments
--block-sizeand--document-column-nameare correctly appended whenpretraining_configis provided.
790-801: LGTM: CLI arguments for pretraining mode.New
--block-sizeand--document-column-namearguments are correctly defined with appropriate help text.
850-861: LGTM: Validation and PretrainingConfig construction from CLI args.Good validation that
--document-column-namerequires--block-size. ThePretrainingConfigis correctly constructed only whenblock_sizeis provided, anddocument_column_nameis included via kwargs when specified.src/instructlab/training/sampler.py (5)
5-5: LGTM: Clean imports for pretraining support.The added imports and logger initialization are appropriate for the new pretraining functionality.
Also applies to: 8-8, 17-17, 23-24
385-389: LGTM: Correct length calculation.The
get_lengthsmethod correctly handles both uniform blocks and the partial last block.
447-447: LGTM: Clean integration of pretraining support.The conditional logic cleanly separates pretraining and fine-tuning dataset construction. The logging is helpful, and backward compatibility is maintained with the optional parameter.
Also applies to: 462-462, 467-478
362-362: I'm unable to complete the verification task as the repository failed to clone. Without direct access to the codebase, I cannot examine the referenced code sections to verify the loss token counting logic.However, based on the review comment itself, which is well-reasoned and identifies a specific concern about consistency in loss token counting between
PretrainingBlockDataset(lines 362, 366) andTokenDataset(lines 421-423), this requires manual verification by someone with access to the repository.Loss token counting consistency requires verification against model loss computation.
The comment correctly identifies that for causal language modeling, N tokens should produce N-1 predictions. However, the consistency between
PretrainingBlockDataset(which uses N-1 viamax(actual_len - 1, 0)andself.block_size - 1) andTokenDataset(which counts labels != -100) needs to be verified by reviewing how the model actually computes loss and whether the -1 is applied consistently across both pretraining and fine-tuning modes.
313-318: I encountered technical difficulties accessing the repository to complete the verification. However, I can provide an assessment based on the code snippet provided:Memory loading pattern confirmed, but verification incomplete.
The code snippet does exhibit the memory concern described—it loads the entire dataset's
input_idsinto a single list during initialization. While this pattern is technically valid for smaller datasets, I cannot verify:
- Target dataset sizes for this project
- Whether this is a documented limitation or known issue
- If alternative lazy-loading patterns exist elsewhere in the codebase
- Real-world usage constraints and typical dataset sizes
Given that I cannot access the repository to validate the severity and context of this concern, manual verification from the development team is needed to confirm whether this is a blocker for your intended use cases.
This PR exposes a new API which allows the ingestion of datasets containing a
documentfield, which then gets tokenized asinput_ids.Summary by CodeRabbit
New Features
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.