LM workload - first steps #856

Niccolo-Ajroldi · 2025-03-19T08:54:43Z

Language Modeling Workload - First Steps

Changes:

Implemented LM workload base class definition, and JAX and PyTorch subclasses
Renamed datasets into dataset to avoid conflict with huggingface datasets library. Provisionary name, we'd need to update the docs accordingly.
Added download_finewebedu to dataset/dataset_setup.py: this function downloads, tokenize and chunks data in blocks of seq_len+1. It uses batch mapping and multiprocessing to speed up tokenizaation and concatenation/chunking.
Implemented dataloaders in JAX and PyTorch

[do not merge] Dev -> Main

Dev -> main

github-actions · 2025-03-19T08:54:57Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

priyakasimbeg · 2025-03-19T21:28:03Z

Thanks Niccolo!
Could just you run the yapf and isort tools to fix the failing tests?

Niccolo-Ajroldi · 2025-03-20T10:59:38Z

Fixed!

priyakasimbeg and others added 20 commits February 4, 2025 17:08

Merge pull request mlcommons#825 from mlcommons/dev

bf61255

[do not merge] Dev -> Main

Merge pull request mlcommons#843 from mlcommons/dev

9653f18

Dev -> main

Merge pull request mlcommons#847 from mlcommons/dev

1d81455

Dev -> main

first LM commit

da5f85a

lm data pipeline

a12a364

testing

ca83ab8

LM workload tested torch pipeline

e3e78dc

LM workload - fix torch tests

e619495

add LM tests, remove dev files

d8e9c56

add LM tests, remove dev files

6b4ff12

Stop tracking .gitignore

3c5c847

Remove dev/ from repo, keep locally

20d841b

fix comments

f3ba059

add class specifications

381451f

add workload LM info

f111d2e

restore data_utils.py tree map

808d398

fixed NFS bug

35f8f89

train/val split before concat

cbb6ee6

renamed datasets to avoid conflict with HF

868987c

Merge remote-tracking branch 'upstream/lm_workload' into lm_workload

8191f6d

Niccolo-Ajroldi requested a review from a team as a code owner March 19, 2025 08:54

renamed datasets to dataset

dd59ded

Niccolo-Ajroldi added 6 commits March 20, 2025 10:52

fix style

496b9c3

fix formatting

50989eb

fix style

5af0fdc

fix style

2683099

fix yapf

6b7ee29

fix style

46b645b

rka97 merged commit 46b645b into mlcommons:lm_workload Mar 26, 2025
16 checks passed

github-actions bot locked and limited conversation to collaborators Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LM workload - first steps #856

LM workload - first steps #856

Uh oh!

Niccolo-Ajroldi commented Mar 19, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 19, 2025 •

edited

Loading

Uh oh!

priyakasimbeg commented Mar 19, 2025

Uh oh!

Niccolo-Ajroldi commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LM workload - first steps #856

LM workload - first steps #856

Uh oh!

Conversation

Niccolo-Ajroldi commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Language Modeling Workload - First Steps

Uh oh!

github-actions bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

priyakasimbeg commented Mar 19, 2025

Uh oh!

Niccolo-Ajroldi commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Niccolo-Ajroldi commented Mar 19, 2025 •

edited

Loading

github-actions bot commented Mar 19, 2025 •

edited

Loading