Skip to content

Add non-CI model accuracy tests covering all supported architectures#1222

Open
brendanlong wants to merge 1 commit intoTransformerLensOrg:devfrom
brendanlong:brendanlong/all-model-accuracy-tests
Open

Add non-CI model accuracy tests covering all supported architectures#1222
brendanlong wants to merge 1 commit intoTransformerLensOrg:devfrom
brendanlong:brendanlong/all-model-accuracy-tests

Conversation

@brendanlong
Copy link
Copy Markdown

@brendanlong brendanlong commented Mar 29, 2026

Description

I was trying to add Deepseek R1 models (#897) but wanted a way to be confident that they were configured correctly. The current tests test a subset of models to avoid taking forever / exceeding disk and memory limits in CI. They also require per-model effort to add.

I added a new test that doesn't run in CI at all and compares all models with the original model to ensure that our model with pieces replaced (like hooked RMSNorm) results in the same logit output. It also ensures that weight matricies aren't all-zero.

I used this to test a WIP PR for Deepseek distills (brendanlong#7) and discovered the issues in #1221 when testing.

To run the tests you can run:

poetry run pytest tests/acceptance/test_model_accuracy.py -m model_accuracy

With optional arguments for the max memory size (default = 1 GB) and a model name filter, i.e., test all Gemma models that will fit in 8 GB of memory:

poetry run pytest tests/acceptance/test_model_accuracy.py -m model_accuracy -k "gemma" --max-model-gb 8

Type of change

  • New feature (non-breaking change which adds functionality)

Screenshots

If I intentionally break Gemma by removing the +1 in RMS norm (https://github.com/TransformerLensOrg/TransformerLens/blob/main/transformer_lens/pretrained/weight_conversions/gemma.py#L129):

Screenshot From 2026-03-28 19-10-43

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

Tests all models in OFFICIAL_MODEL_NAMES (filtered by --max-model-gb)
with three checks per model:
- Weights loaded correctly (no all-zero weight matrices)
- Forward pass logits match HuggingFace (softmax atol=5e-3)
- Weight processing (fold_ln, etc.) preserves model behavior (atol=1e-3)

Not run in CI; must be explicitly selected with -m model_accuracy.
Intended for manual use when adding or modifying model support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@brendanlong brendanlong changed the title Add model accuracy tests covering all supported architectures Add non-CI model accuracy tests covering all supported architectures Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants