[Bug/ Feature]: Improve the MSE observer, investigate regression

## Background ##
Calibration is the process of calculating quantization scales and zero points from observed values, either weights or activations. Weights and activations are observed using an `Observer`, which computes a min and max value of the observed values.

As of now there are 4 observers: `MemorylessMinMaxObserver`, `StaticMinMaxObserver`, `MinMaxObserver` (ie MovingAverageMinMaxObserver), `MemorylessMSEObserver`, `MovingAverageMSEObserver`. This naming is partially an artifact of backwards compatibility. The two observers we are interested in for this ticket are `MemorylessMinMaxObserver` and `MemorylessMSEObserver`.

Traditionally and in theory, the MSE observer should provide better accuracy recovery than the Minmax observer. This is because the MSE observer does a grid search across qparams in order to search for the optimal param values for quantization. However, this is not what is observed. Instead, the MSE observer performs worse than the minmax observer.

| Llama-3.1-8B-Instruct gsm8k_platinum strict-match | MinMax | MSE |
-- | -- | --
Before Refactor (October 14th) | 0.7767 | 0.7171
After Refactor | 0.7767 | 0.7171

At some point in the past, the MSE observer always performed better than minmax. However, something as changed in the past few months which mean that is no longer the case.

## Purpose ##
* Improve accuracy recovery of the MSE observer such that recovery is equal to or better than the Minmax observer

## Suggested Steps ##
- [ ] Replicate eval regression using Llama 3b-instruct
- [ ] Write a test to check if MSE observer quantizes any random tensor tensor equal to or better than Minmax
- [ ] Make improvements to the MSE observer, thus improving the possible accuracy recovery of LLM Compressor as a whole

## Helper Scripts ##
<details><summary>eval.py</summary>

```python3
import sys
import lm_eval

model_id = sys.argv[1]

print(model_id)
results = lm_eval.simple_evaluate(
    # 3) hf serialized
    model="hf",
    model_args={
        "pretrained": model_id,
        "add_bos_token": False,
        "dtype": "auto",
        "device_map": "cuda",
        #"max_length": 128000,
    },
    device="cuda",
    # 3/)

    #tasks=["gsm8k_platinum", "mmlu_llama", "longbench2_single"],
    tasks=["gsm8k_platinum"],
    batch_size=64,
    apply_chat_template=True,
    fewshot_as_multiturn=True,
)
print(model_id)
print(lm_eval.utils.make_table(results))
```
</details>

[llama3_example.py](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py) (I suggest testing with a smaller model)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug/ Feature]: Improve the MSE observer, investigate regression #2094

Background

Purpose

Suggested Steps

Helper Scripts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama-3.1-8B-Instruct gsm8k_platinum strict-match	MinMax	MSE
Before Refactor (October 14th)	0.7767	0.7171
After Refactor	0.7767	0.7171

[Bug/ Feature]: Improve the MSE observer, investigate regression #2094

Description

Background

Purpose

Suggested Steps

Helper Scripts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions