Language Model

An end-to-end implementation of a language model from scratch, including tokenization and training.

Thanks to CS 336 for the setup and scaffolding.

Setup

Download the TinyStories dataset used to train the tokenizer and language model.

mkdir -p data
cd data

wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt -O train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt -O valid.txt

cd ..

Training

Tokenizer

To train the BPE tokenizer and encode the training/validation data:

uv run src/scripts/train_tokenizer.py --dataset=data/train.txt --vocab_size=10000 --out=data/tokenizer
# note: encoding may take a few minutes (~2GB of text)
uv run src/scripts/encode.py --tokenizer=data/tokenizer --file=data/train.txt --out=data/train.npy --processes=16
uv run src/scripts/encode.py --tokenizer=data/tokenizer --file=data/valid.txt --out=data/valid.npy --processes=16

The trained tokenizer will be saved to the output directory in two files: merges.txt and vocab.json.

Model

To train the language model:

uv run src/scripts/train_model.py \
    --train-dataset=data/train.npy \
    --val-dataset=data/valid.npy \
    --lr=0.001 \
    --batch-size=32 \
    --checkpoint-dir=data/checkpoints

The model architecture uses pre-norm with RMSNorm, rotary positional embeddings (RoPE), SwiGLU activations, and multi-head attention (MHA).

The model has the following default parameters (total ~23M):

d_model: 512
num_heads: 16
d_ff: 1344
num_layers: 4
context_length: 256
RoPE with theta=10,000

Training is done under AdamW with linear warm-up and cosine annealing. Batches are randomly selected and loaded from the encoded training tokens.

To see all available options, run uv run src/scripts/train_model.py --help.

Inference

To run inference interactively:

uv run src/scripts/inference.py --checkpoint=path/to/ckpt.pt

Or, on a single prompt:

uv run src/scripts/inference.py --checkpoint=path/to/ckpt.pt --prompt="There was once a"

Experiments

All experiments were done on an 11-core M3 Pro with 18 GB RAM.

Learning Rate

The training loss starts to diverge at around lr=1e-2. The optimal lr seems to be around 1e-3, as lr=3e-3 struggles to converge and lr=3e-4 converges slightly slower.

RMSNorm

Without RMSNorm, the loss explodes after a few hundred steps:

TODO: compare pre-norm vs post-norm.

Positional Embeddings

We compared the performance of a model with and without RoPE. Without any positional embeddings, the model cannot identify the exact position of a token, nor the distance between two tokens.

Surprisingly, the difference in performance is quite small. This can likely be attributed to the fact that the TinyStories dataset is extremely simple, and predicting the next word from an unordered bag of previous tokens is quite possible.

Activation Functions

SwiGLU vs SiLU vs GeGLU vs GeLU (at same parameter counts)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Model

Setup

Training

Tokenizer

Model

Inference

Experiments

Learning Rate

RMSNorm

Positional Embeddings

Activation Functions

About

Uh oh!

Releases

Packages

Languages

License

KevinL10/lm

Folders and files

Latest commit

History

Repository files navigation

Language Model

Setup

Training

Tokenizer

Model

Inference

Experiments

Learning Rate

RMSNorm

Positional Embeddings

Activation Functions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages