New memmap dataset format #381

jlamypoirier · 2025-10-18T04:08:38Z

✨ Description

New format for memmap datasets, which allows for more varied input types:

MemmapDataset is now agnostic of the data (sample type) being referred to and delegates much of the work to dynamic readers and writers.
Associate each sample type with a dynamic reader config holding metadata for the stored dataset (ex. dynamic type, buffer range, document and token count, etc.), and a reader/writer pair that handles the actual data.
Keep the existing memmap dataset for backward compatibility ("legacy memmap"), but remove dataset writing capability (except in test_match_megatron).
Simplify GPT data preparator. Merge the multiple tokenization methods into a single _prepare_sample, and merge tokenization with saving in a single "prepare" loop. The preparator now iterates only once through the dataset, and doesn't need to keep it the whole thing in memory.
Minor changes to preparator config: replace tokens_per_shard withdocuments_per_shard (computing tokens required an extra pass through the dataset and gave no major benefit). Merge multiple worker count configs into a single num_workers. Adjust entries in source schema, remove unnecessary dynamic class.
Add end-to-end tests for the data preparator. Replace the random testing datasets with improved versions based on the dataset preparator. Adjust and improve related tests.

File structure:

Hard-coded header ("fast_llm_prepared_dataset")
Pointer to the reader config (int64). Actual config not written here because it's not available until the whole dataset is written.
Reader-specific content.
Reader config (json-serialized).

Three reader types are currently implemented.

Tokens:
- Hard-coded header
- Tokens
- Cumulative sums of document token counts (for locating begin/end of documents)
- Hard-coded footer.
Range:
- Hard-coded header
- Ranges
- Cumulative sums of number of ranges in each document (for locating begin/end of documents)
- Hard-coded footer.
Language model
- Hard-coded header
- Token reader content
- (Optional) Loss masking span (range) reader content.
- (Optional) Chosen span (range) reader content.
- (Optional) Rejected span (range) reader content.
- Hard-coded footer.

jlamypoirier added 3 commits October 18, 2025 00:07

Memmap dataset

90cd009

fixes

acfd30e

fixes

34939e9

jlamypoirier marked this pull request as ready for review October 29, 2025 23:45

jlamypoirier added 7 commits October 29, 2025 19:46

int64

c5fa072

Test and fix preparator

cd28676

fix

435d214

fix

f6bef55

fix

e05d9a1

fix

9ba8d1b

fixes

b35b297

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New memmap dataset format #381

New memmap dataset format #381

Uh oh!

jlamypoirier commented Oct 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New memmap dataset format #381

Are you sure you want to change the base?

New memmap dataset format #381

Uh oh!

Conversation

jlamypoirier commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jlamypoirier commented Oct 18, 2025 •

edited

Loading