An end-to-end implementation of a language model from scratch, including tokenization and training.
Thanks to CS 336 for the setup and scaffolding.
Download the TinyStories dataset used to train the tokenizer and language model.
mkdir -p data
cd data
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt -O train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt -O valid.txt
cd ..To train the BPE tokenizer and encode the training/validation data:
uv run src/scripts/train_tokenizer.py --dataset=data/train.txt --vocab_size=10000 --out=data/tokenizer
# note: encoding may take a few minutes (~2GB of text)
uv run src/scripts/encode.py --tokenizer=data/tokenizer --file=data/train.txt --out=data/train.npy --processes=16
uv run src/scripts/encode.py --tokenizer=data/tokenizer --file=data/valid.txt --out=data/valid.npy --processes=16The trained tokenizer will be saved to the output directory in two files: merges.txt and vocab.json.
To train the language model:
uv run src/scripts/train_model.py \
--train-dataset=data/train.npy \
--val-dataset=data/valid.npy \
--lr=0.001 \
--batch-size=32 \
--checkpoint-dir=data/checkpointsThe model architecture uses pre-norm with RMSNorm, rotary positional embeddings (RoPE), SwiGLU activations, and multi-head attention (MHA).
The model has the following default parameters (total ~23M):
- d_model: 512
- num_heads: 16
- d_ff: 1344
- num_layers: 4
- context_length: 256
- RoPE with theta=10,000
Training is done under AdamW with linear warm-up and cosine annealing. Batches are randomly selected and loaded from the encoded training tokens.
To see all available options, run uv run src/scripts/train_model.py --help.
To run inference interactively:
uv run src/scripts/inference.py --checkpoint=path/to/ckpt.ptOr, on a single prompt:
uv run src/scripts/inference.py --checkpoint=path/to/ckpt.pt --prompt="There was once a"All experiments were done on an 11-core M3 Pro with 18 GB RAM.
The training loss starts to diverge at around lr=1e-2. The optimal lr seems to be around 1e-3, as lr=3e-3 struggles to converge and lr=3e-4 converges slightly slower.
Without RMSNorm, the loss explodes after a few hundred steps:

TODO: compare pre-norm vs post-norm.
We compared the performance of a model with and without RoPE. Without any positional embeddings, the model cannot identify the exact position of a token, nor the distance between two tokens.
Surprisingly, the difference in performance is quite small. This can likely be attributed to the fact that the TinyStories dataset is extremely simple, and predicting the next word from an unordered bag of previous tokens is quite possible.
SwiGLU vs SiLU vs GeGLU vs GeLU (at same parameter counts)

