-
Notifications
You must be signed in to change notification settings - Fork 306
Description
Background
Calibration is the process of calculating quantization scales and zero points from observed values, either weights or activations. Weights and activations are observed using an Observer, which computes a min and max value of the observed values.
As of now there are 4 observers: MemorylessMinMaxObserver, StaticMinMaxObserver, MinMaxObserver (ie MovingAverageMinMaxObserver), MemorylessMSEObserver, MovingAverageMSEObserver. This naming is partially an artifact of backwards compatibility. The two observers we are interested in for this ticket are MemorylessMinMaxObserver and MemorylessMSEObserver.
Traditionally and in theory, the MSE observer should provide better accuracy recovery than the Minmax observer. This is because the MSE observer does a grid search across qparams in order to search for the optimal param values for quantization. However, this is not what is observed. Instead, the MSE observer performs worse than the minmax observer.
| Llama-3.1-8B-Instruct gsm8k_platinum strict-match | MinMax | MSE |
|---|---|---|
| Before Refactor (October 14th) | 0.7767 | 0.7171 |
| After Refactor | 0.7767 | 0.7171 |
At some point in the past, the MSE observer always performed better than minmax. However, something as changed in the past few months which mean that is no longer the case.
Purpose
- Improve accuracy recovery of the MSE observer such that recovery is equal to or better than the Minmax observer
Suggested Steps
- Replicate eval regression using Llama 3b-instruct
- Write a test to check if MSE observer quantizes any random tensor tensor equal to or better than Minmax
- Make improvements to the MSE observer, thus improving the possible accuracy recovery of LLM Compressor as a whole
Helper Scripts
eval.py
import sys
import lm_eval
model_id = sys.argv[1]
print(model_id)
results = lm_eval.simple_evaluate(
# 3) hf serialized
model="hf",
model_args={
"pretrained": model_id,
"add_bos_token": False,
"dtype": "auto",
"device_map": "cuda",
#"max_length": 128000,
},
device="cuda",
# 3/)
#tasks=["gsm8k_platinum", "mmlu_llama", "longbench2_single"],
tasks=["gsm8k_platinum"],
batch_size=64,
apply_chat_template=True,
fewshot_as_multiturn=True,
)
print(model_id)
print(lm_eval.utils.make_table(results))llama3_example.py (I suggest testing with a smaller model)