Skip to content

KevinL10/lm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Model

asciicast

An end-to-end implementation of a language model from scratch, including tokenization and training.

Thanks to CS 336 for the setup and scaffolding.

Setup

Download the TinyStories dataset used to train the tokenizer and language model.

mkdir -p data
cd data

wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt -O train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt -O valid.txt

cd ..

Training

Tokenizer

To train the BPE tokenizer and encode the training/validation data:

uv run src/scripts/train_tokenizer.py --dataset=data/train.txt --vocab_size=10000 --out=data/tokenizer
# note: encoding may take a few minutes (~2GB of text)
uv run src/scripts/encode.py --tokenizer=data/tokenizer --file=data/train.txt --out=data/train.npy --processes=16
uv run src/scripts/encode.py --tokenizer=data/tokenizer --file=data/valid.txt --out=data/valid.npy --processes=16

The trained tokenizer will be saved to the output directory in two files: merges.txt and vocab.json.

Model

To train the language model:

uv run src/scripts/train_model.py \
    --train-dataset=data/train.npy \
    --val-dataset=data/valid.npy \
    --lr=0.001 \
    --batch-size=32 \
    --checkpoint-dir=data/checkpoints

The model architecture uses pre-norm with RMSNorm, rotary positional embeddings (RoPE), SwiGLU activations, and multi-head attention (MHA).

The model has the following default parameters (total ~23M):

  • d_model: 512
  • num_heads: 16
  • d_ff: 1344
  • num_layers: 4
  • context_length: 256
  • RoPE with theta=10,000

Training is done under AdamW with linear warm-up and cosine annealing. Batches are randomly selected and loaded from the encoded training tokens.

To see all available options, run uv run src/scripts/train_model.py --help.

Inference

To run inference interactively:

uv run src/scripts/inference.py --checkpoint=path/to/ckpt.pt

Or, on a single prompt:

uv run src/scripts/inference.py --checkpoint=path/to/ckpt.pt --prompt="There was once a"

Experiments

All experiments were done on an 11-core M3 Pro with 18 GB RAM.

Learning Rate

Learning rate sweep

The training loss starts to diverge at around lr=1e-2. The optimal lr seems to be around 1e-3, as lr=3e-3 struggles to converge and lr=3e-4 converges slightly slower.

RMSNorm

Without RMSNorm, the loss explodes after a few hundred steps: No RMSNorm

TODO: compare pre-norm vs post-norm.

Positional Embeddings

Positional Embeddings

We compared the performance of a model with and without RoPE. Without any positional embeddings, the model cannot identify the exact position of a token, nor the distance between two tokens.

Surprisingly, the difference in performance is quite small. This can likely be attributed to the fact that the TinyStories dataset is extremely simple, and predicting the next word from an unordered bag of previous tokens is quite possible.

Activation Functions

SwiGLU vs SiLU vs GeGLU vs GeLU (at same parameter counts)

About

An end-to-end language model implementation from scratch.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages