Add non-CI model accuracy tests covering all supported architectures#1222
Open
brendanlong wants to merge 1 commit intoTransformerLensOrg:devfrom
Open
Add non-CI model accuracy tests covering all supported architectures#1222brendanlong wants to merge 1 commit intoTransformerLensOrg:devfrom
brendanlong wants to merge 1 commit intoTransformerLensOrg:devfrom
Conversation
Tests all models in OFFICIAL_MODEL_NAMES (filtered by --max-model-gb) with three checks per model: - Weights loaded correctly (no all-zero weight matrices) - Forward pass logits match HuggingFace (softmax atol=5e-3) - Weight processing (fold_ln, etc.) preserves model behavior (atol=1e-3) Not run in CI; must be explicitly selected with -m model_accuracy. Intended for manual use when adding or modifying model support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
I was trying to add Deepseek R1 models (#897) but wanted a way to be confident that they were configured correctly. The current tests test a subset of models to avoid taking forever / exceeding disk and memory limits in CI. They also require per-model effort to add.
I added a new test that doesn't run in CI at all and compares all models with the original model to ensure that our model with pieces replaced (like hooked RMSNorm) results in the same logit output. It also ensures that weight matricies aren't all-zero.
I used this to test a WIP PR for Deepseek distills (brendanlong#7) and discovered the issues in #1221 when testing.
To run the tests you can run:
With optional arguments for the max memory size (default = 1 GB) and a model name filter, i.e., test all Gemma models that will fit in 8 GB of memory:
Type of change
Screenshots
If I intentionally break Gemma by removing the +1 in RMS norm (https://github.com/TransformerLensOrg/TransformerLens/blob/main/transformer_lens/pretrained/weight_conversions/gemma.py#L129):
Checklist: