Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Oct 18, 2025

✨ Description

New format for memmap datasets, which allows for more varied input types:

  • MemmapDataset is now agnostic of the data (sample type) being referred to and delegates much of the work to dynamic readers and writers.
  • Associate each sample type with a dynamic reader config holding metadata for the stored dataset (ex. dynamic type, buffer range, document and token count, etc.), and a reader/writer pair that handles the actual data.
  • Keep the existing memmap dataset for backward compatibility ("legacy memmap"), but remove dataset writing capability (except in test_match_megatron).
  • Simplify GPT data preparator. Merge the multiple tokenization methods into a single _prepare_sample, and merge tokenization with saving in a single "prepare" loop. The preparator now iterates only once through the dataset, and doesn't need to keep it the whole thing in memory.
  • Minor changes to preparator config: replace tokens_per_shard withdocuments_per_shard (computing tokens required an extra pass through the dataset and gave no major benefit). Merge multiple worker count configs into a single num_workers. Adjust entries in source schema, remove unnecessary dynamic class.
  • Add end-to-end tests for the data preparator. Replace the random testing datasets with improved versions based on the dataset preparator. Adjust and improve related tests.

File structure:

  • Hard-coded header ("fast_llm_prepared_dataset")
  • Pointer to the reader config (int64). Actual config not written here because it's not available until the whole dataset is written.
  • Reader-specific content.
  • Reader config (json-serialized).

Three reader types are currently implemented.

  1. Tokens:
    • Hard-coded header
    • Tokens
    • Cumulative sums of document token counts (for locating begin/end of documents)
    • Hard-coded footer.
  2. Range:
    • Hard-coded header
    • Ranges
    • Cumulative sums of number of ranges in each document (for locating begin/end of documents)
    • Hard-coded footer.
  3. Language model
    • Hard-coded header
    • Token reader content
    • (Optional) Loss masking span (range) reader content.
    • (Optional) Chosen span (range) reader content.
    • (Optional) Rejected span (range) reader content.
    • Hard-coded footer.

@jlamypoirier jlamypoirier marked this pull request as ready for review October 29, 2025 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants