Skip to content

Conversation

@Niccolo-Ajroldi
Copy link
Member

@Niccolo-Ajroldi Niccolo-Ajroldi commented Mar 19, 2025

Language Modeling Workload - First Steps

Changes:

  • Implemented LM workload base class definition, and JAX and PyTorch subclasses
  • Renamed datasets into dataset to avoid conflict with huggingface datasets library. Provisionary name, we'd need to update the docs accordingly.
  • Added download_finewebedu to dataset/dataset_setup.py: this function downloads, tokenize and chunks data in blocks of seq_len+1. It uses batch mapping and multiprocessing to speed up tokenizaation and concatenation/chunking.
  • Implemented dataloaders in JAX and PyTorch

@Niccolo-Ajroldi Niccolo-Ajroldi requested a review from a team as a code owner March 19, 2025 08:54
@github-actions
Copy link

github-actions bot commented Mar 19, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@priyakasimbeg
Copy link
Contributor

Thanks Niccolo!
Could just you run the yapf and isort tools to fix the failing tests?

@Niccolo-Ajroldi
Copy link
Member Author

Fixed!

@rka97 rka97 merged commit 46b645b into mlcommons:lm_workload Mar 26, 2025
16 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants