You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Learning about the training arguments](#learning-about-training-arguments)
30
31
-[`TrainingArgs`](#trainingargs)
@@ -122,6 +123,46 @@ The library now supports an optional `reasoning_content` field in addition to th
122
123
}
123
124
```
124
125
126
+
## Continual pretraining mode
127
+
128
+
In addition to instruction tuning, the library can run document-style continual pretraining on raw text corpora.
129
+
Enable this by supplying a block size when invoking `main_ds.py`:
130
+
131
+
```bash
132
+
torchrun main_ds.py \
133
+
--model_name_or_path mistralai/Mistral-7B-v0.1 \
134
+
--data_path /data/documents.jsonl \
135
+
--ckpt_output_dir ./checkpoints \
136
+
--effective_batch_size 128 \
137
+
--max_batch_len 60000 \
138
+
--block-size 8192 \
139
+
--document-column-name text # optional, defaults to "document"
140
+
```
141
+
142
+
-`--block-size` (required) toggles continual pretraining and controls how many tokens are packed into each block.
143
+
-`--document-column-name` (optional) specifies which JSONL field contains the raw document text.
144
+
145
+
The same options are available programmatically via `TrainingArgs.pretraining_config`:
146
+
147
+
```python
148
+
from instructlab.training import TrainingArgs, PretrainingConfig
149
+
150
+
train_args = TrainingArgs(
151
+
model_name_or_path="mistralai/Mistral-7B-v0.1",
152
+
data_path="documents.jsonl",
153
+
ckpt_output_dir="./checkpoints",
154
+
max_seq_len=4096,
155
+
max_batch_len=40000,
156
+
effective_batch_size=128,
157
+
pretraining_config=PretrainingConfig(
158
+
block_size=2048,
159
+
document_column_name="text", # optional
160
+
),
161
+
)
162
+
```
163
+
164
+
When a pretraining config is provided, `process_documents_for_pretraining()` is invoked under the hood to tokenize raw documents before training.
165
+
125
166
**Standard message structure:**
126
167
127
168
```json
@@ -139,7 +180,7 @@ The library now supports an optional `reasoning_content` field in addition to th
139
180
}
140
181
```
141
182
142
-
####Important Notes
183
+
### Important Notes
143
184
144
185
1.**Automatic reasoning content processing**: If `reasoning_content` exists in a message, it will always be processed and unmasked as long as the message role is targeted for unmasking. This ensures that reasoning traces are properly included in the training data.
0 commit comments