[GGUF] Reduce peak RAM usage by casting dequantized tensors early during load#45386
Open
UsamaKenway wants to merge 3 commits intohuggingface:mainfrom
Open
[GGUF] Reduce peak RAM usage by casting dequantized tensors early during load#45386UsamaKenway wants to merge 3 commits intohuggingface:mainfrom
UsamaKenway wants to merge 3 commits intohuggingface:mainfrom
Conversation
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Member
|
cc @SunMarc |
SunMarc
reviewed
Apr 13, 2026
src/transformers/modeling_utils.py
Outdated
Comment on lines
4092
to
4096
Member
There was a problem hiding this comment.
can we move that above so that we don't have to replicate the dtype logic ?
|
|
||
| parsed_parameters["tensors"][name] = torch.from_numpy(np.copy(weights)) | ||
| tensor = torch.from_numpy(np.copy(weights)) | ||
| if torch_dtype is not None and torch_dtype != torch.float32: |
Member
There was a problem hiding this comment.
do we really need the fp32 check ?
Signed-off-by: Usama Kenway <usamakenway@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimizes memory usage when loading GGUF models by performing dtype casting immediately after dequantization.
While I was adding the support for Gemma4 in this PR #45296, i noticed this issue that the GGUF tensors are dequantized to
float32by default during the loading process, even if the user intends to load the model infloat16orbfloat16. For large models, this creates a significant RAM spike that can lead to Out Of Memory.By passing the target
torch_dtypedirectly into the loading utility, we can cast the tensors immediately after dequantization, effectively halving the peak RAM required for the state dict.Benchmark Results (Gemma 4 26B IT q4_k_m)
I tested the peak RAM (Global Peak RSS) with and without this change using a separate branch for tracking:
Tests
With the changes
Without the changes
Code Agent Policy
Before submitting
Pull Request section?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.