Skip to content

Commit fede3e7

Browse files
committed
add docs
1 parent 75aa3bc commit fede3e7

File tree

1 file changed

+42
-1
lines changed

1 file changed

+42
-1
lines changed

README.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ The library now supports reasoning traces through the `reasoning_content` field
2525
- [Using the library](#using-the-library)
2626
- [Data format](#data-format)
2727
- [Reasoning content support](#reasoning-content-support-1)
28+
- [Continual pretraining mode](#continual-pretraining-mode)
2829
- [Documentation](#documentation)
2930
- [Learning about the training arguments](#learning-about-training-arguments)
3031
- [`TrainingArgs`](#trainingargs)
@@ -122,6 +123,46 @@ The library now supports an optional `reasoning_content` field in addition to th
122123
}
123124
```
124125

126+
## Continual pretraining mode
127+
128+
In addition to instruction tuning, the library can run document-style continual pretraining on raw text corpora.
129+
Enable this by supplying a block size when invoking `main_ds.py`:
130+
131+
```bash
132+
torchrun main_ds.py \
133+
--model_name_or_path mistralai/Mistral-7B-v0.1 \
134+
--data_path /data/documents.jsonl \
135+
--ckpt_output_dir ./checkpoints \
136+
--effective_batch_size 128 \
137+
--max_batch_len 60000 \
138+
--block-size 8192 \
139+
--document-column-name text # optional, defaults to "document"
140+
```
141+
142+
- `--block-size` (required) toggles continual pretraining and controls how many tokens are packed into each block.
143+
- `--document-column-name` (optional) specifies which JSONL field contains the raw document text.
144+
145+
The same options are available programmatically via `TrainingArgs.pretraining_config`:
146+
147+
```python
148+
from instructlab.training import TrainingArgs, PretrainingConfig
149+
150+
train_args = TrainingArgs(
151+
model_name_or_path="mistralai/Mistral-7B-v0.1",
152+
data_path="documents.jsonl",
153+
ckpt_output_dir="./checkpoints",
154+
max_seq_len=4096,
155+
max_batch_len=40000,
156+
effective_batch_size=128,
157+
pretraining_config=PretrainingConfig(
158+
block_size=2048,
159+
document_column_name="text", # optional
160+
),
161+
)
162+
```
163+
164+
When a pretraining config is provided, `process_documents_for_pretraining()` is invoked under the hood to tokenize raw documents before training.
165+
125166
**Standard message structure:**
126167

127168
```json
@@ -139,7 +180,7 @@ The library now supports an optional `reasoning_content` field in addition to th
139180
}
140181
```
141182

142-
#### Important Notes
183+
### Important Notes
143184

144185
1. **Automatic reasoning content processing**: If `reasoning_content` exists in a message, it will always be processed and unmasked as long as the message role is targeted for unmasking. This ensures that reasoning traces are properly included in the training data.
145186

0 commit comments

Comments
 (0)