Add non-CI model accuracy tests covering all supported architectures by brendanlong · Pull Request #1222 · TransformerLensOrg/TransformerLens

brendanlong · 2026-03-29T02:11:21Z

Description

I was trying to add Deepseek R1 models (#897) but wanted a way to be confident that they were configured correctly. The current tests test a subset of models to avoid taking forever / exceeding disk and memory limits in CI. They also require per-model effort to add.

I added a new test that doesn't run in CI at all and compares all models with the original model to ensure that our model with pieces replaced (like hooked RMSNorm) results in the same logit output. It also ensures that weight matricies aren't all-zero.

I used this to test a WIP PR for Deepseek distills (brendanlong#7) and discovered the issues in #1221 when testing.

To run the tests you can run:

poetry run pytest tests/acceptance/test_model_accuracy.py -m model_accuracy

With optional arguments for the max memory size (default = 1 GB) and a model name filter, i.e., test all Gemma models that will fit in 8 GB of memory:

poetry run pytest tests/acceptance/test_model_accuracy.py -m model_accuracy -k "gemma" --max-model-gb 8

Type of change

New feature (non-breaking change which adds functionality)

Screenshots

If I intentionally break Gemma by removing the +1 in RMS norm (https://github.com/TransformerLensOrg/TransformerLens/blob/main/transformer_lens/pretrained/weight_conversions/gemma.py#L129):

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

Tests all models in OFFICIAL_MODEL_NAMES (filtered by --max-model-gb) with three checks per model: - Weights loaded correctly (no all-zero weight matrices) - Forward pass logits match HuggingFace (softmax atol=5e-3) - Weight processing (fold_ln, etc.) preserves model behavior (atol=1e-3) Not run in CI; must be explicitly selected with -m model_accuracy. Intended for manual use when adding or modifying model support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

brendanlong changed the title ~~Add model accuracy tests covering all supported architectures~~ Add non-CI model accuracy tests covering all supported architectures Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add non-CI model accuracy tests covering all supported architectures#1222

Add non-CI model accuracy tests covering all supported architectures#1222
brendanlong wants to merge 1 commit intoTransformerLensOrg:devfrom
brendanlong:brendanlong/all-model-accuracy-tests

brendanlong commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brendanlong commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Screenshots

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brendanlong commented Mar 29, 2026 •

edited

Loading