-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Labels
Description
🐞 Describe the Bug
An error occurs in cache management during the loading of the generated blended dataset config file.
🔄 Steps to Reproduce
Steps to reproduce the behavior:
- The dataset was tokenized using the prebuilt Fast-LLM image:
ghcr.io/servicenow/fast-llm:sha-338bb62 - Tokenizer: Llama-3.1-8B
- Dataset: fineweb-edu
- Create a basic training config and specify the generated dataset config for training
- The error occurs during dataset loading
RuntimeError: Invalid dataset cache for dataset __mnt__datasets__tokenized__Llama-3.1-8B__fineweb-edu__shard_0_1. If this is due to an intended configuration change, please delete the cache before continuing.
Current config:
config:
seed: 785266
dataset:
documents_per_epoch: 1005784
name: __mnt__datasets__tokenized__Llama-3.1-8B__fineweb-edu__shard_0_1
tokens_per_epoch: 1041306959
num_samples: 1861
sequence_length: 4096
truncate_documents: true
unshuffled_epochs: 0
unshuffled_tokens: 0
Cached config:
config:
seed: 784569
dataset:
documents_per_epoch: 1005784
name: __mnt__datasets__tokenized__Llama-3.1-8B__fineweb-edu__shard_0_0
tokens_per_epoch: 1043542365
num_samples: 1865
sequence_length: 4096
truncate_documents: true
unshuffled_epochs: 0
unshuffled_tokens: 0