Skip to content

[Bug] Error loading blended dataset generated with the prepare tool from the dataset config file #297

@bigximik

Description

@bigximik

🐞 Describe the Bug

An error occurs in cache management during the loading of the generated blended dataset config file.

🔄 Steps to Reproduce

Steps to reproduce the behavior:

  • The dataset was tokenized using the prebuilt Fast-LLM image: ghcr.io/servicenow/fast-llm:sha-338bb62
  • Tokenizer: Llama-3.1-8B
  • Dataset: fineweb-edu
  • Create a basic training config and specify the generated dataset config for training
  • The error occurs during dataset loading
RuntimeError: Invalid dataset cache for dataset __mnt__datasets__tokenized__Llama-3.1-8B__fineweb-edu__shard_0_1. If this is due to an intended configuration change, please delete the cache before continuing.
Current config:
config:
  seed: 785266
dataset:
  documents_per_epoch: 1005784
  name: __mnt__datasets__tokenized__Llama-3.1-8B__fineweb-edu__shard_0_1
  tokens_per_epoch: 1041306959
num_samples: 1861
sequence_length: 4096
truncate_documents: true
unshuffled_epochs: 0
unshuffled_tokens: 0
Cached config:
config:
  seed: 784569
dataset:
  documents_per_epoch: 1005784
  name: __mnt__datasets__tokenized__Llama-3.1-8B__fineweb-edu__shard_0_0
  tokens_per_epoch: 1043542365
num_samples: 1865
sequence_length: 4096
truncate_documents: true
unshuffled_epochs: 0
unshuffled_tokens: 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions