Skip to content

[Bug/ Feature]: Improve the MSE observer, investigate regression #2094

@kylesayrs

Description

@kylesayrs

Background

Calibration is the process of calculating quantization scales and zero points from observed values, either weights or activations. Weights and activations are observed using an Observer, which computes a min and max value of the observed values.

As of now there are 4 observers: MemorylessMinMaxObserver, StaticMinMaxObserver, MinMaxObserver (ie MovingAverageMinMaxObserver), MemorylessMSEObserver, MovingAverageMSEObserver. This naming is partially an artifact of backwards compatibility. The two observers we are interested in for this ticket are MemorylessMinMaxObserver and MemorylessMSEObserver.

Traditionally and in theory, the MSE observer should provide better accuracy recovery than the Minmax observer. This is because the MSE observer does a grid search across qparams in order to search for the optimal param values for quantization. However, this is not what is observed. Instead, the MSE observer performs worse than the minmax observer.

Llama-3.1-8B-Instruct gsm8k_platinum strict-match MinMax MSE
Before Refactor (October 14th) 0.7767 0.7171
After Refactor 0.7767 0.7171

At some point in the past, the MSE observer always performed better than minmax. However, something as changed in the past few months which mean that is no longer the case.

Purpose

  • Improve accuracy recovery of the MSE observer such that recovery is equal to or better than the Minmax observer

Suggested Steps

  • Replicate eval regression using Llama 3b-instruct
  • Write a test to check if MSE observer quantizes any random tensor tensor equal to or better than Minmax
  • Make improvements to the MSE observer, thus improving the possible accuracy recovery of LLM Compressor as a whole

Helper Scripts

eval.py
import sys
import lm_eval

model_id = sys.argv[1]

print(model_id)
results = lm_eval.simple_evaluate(
    # 3) hf serialized
    model="hf",
    model_args={
        "pretrained": model_id,
        "add_bos_token": False,
        "dtype": "auto",
        "device_map": "cuda",
        #"max_length": 128000,
    },
    device="cuda",
    # 3/)

    #tasks=["gsm8k_platinum", "mmlu_llama", "longbench2_single"],
    tasks=["gsm8k_platinum"],
    batch_size=64,
    apply_chat_template=True,
    fewshot_as_multiturn=True,
)
print(model_id)
print(lm_eval.utils.make_table(results))

llama3_example.py (I suggest testing with a smaller model)

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestgood first issueA good first issue for users wanting to contributenvfp4For any PR / issue related to NVFP4 supportwNa16Anything related to weight-only int-quantized support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions