Skip to content

Commit 3ffb650

Browse files
committed
update readme and fix CI
Signed-off-by: He, Xin3 <[email protected]>
1 parent 99b8fff commit 3ffb650

File tree

3 files changed

+6
-9
lines changed

3 files changed

+6
-9
lines changed

examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ In this example, you can verify the accuracy on HPU/CUDA device with emulation o
88
# neural-compressor-pt
99
pip install neural-compressor-pt==3.7
1010
# auto-round
11-
pip install auto-round==0.9.1
11+
pip install auto-round==0.9.2
1212
# other requirements
1313
pip install -r requirements.txt
1414
```
@@ -19,7 +19,7 @@ pip install -r requirements.txt
1919
# neural-compressor-pt
2020
INC_PT_ONLY=1 pip install git+https://github.com/intel/neural-compressor.git@master
2121
# auto-round
22-
pip install git+https://github.com/intel/auto-round.git@main
22+
pip install git+https://github.com/intel/auto-round.git@more-ar-ext
2323
# other requirements
2424
pip install -r requirements.txt
2525
```
@@ -44,7 +44,7 @@ CUDA_VISIBLE_DEVICES=0 python quantize.py \
4444
```
4545

4646
Notes:
47-
- Use `--export_format auto_round` for `MXFP4`, `MXFP8` data type and do inference as [below](#mxfp4--mxfp8)
47+
- Use `--export_format auto_round` for `MXFP4`, `MXFP8` data type and do inference as below.
4848
- Use `--export_format llm_compressor` for `NVFP4` data type since public vLLM supports it.
4949
- Use `--export_format fake` for `uNVFP4` data type since it's not fully supported.
5050
- Setting `--quant_lm_head` applies `--dtype` for the lm_head layer.
@@ -87,7 +87,6 @@ AutoRound helps improve the accuracy, `iters` and `nsamples` is higher than defa
8787
CUDA_VISIBLE_DEVICES=0 bash run_quant.sh --topology=Llama-3.1-8B --dtype=mxfp8 --input_model=/models/Meta-Llama-3.1-8B-Instruct --output_model=Llama-3.1-8B-MXFP8
8888
```
8989

90-
9190
#### Llama 3.1 8B MXFP4 (Mixed with MXFP8, Target_bits=7.8)
9291

9392
```bash
@@ -119,7 +118,7 @@ Note: If you got OOM issue, either increasing `CUDA_VISIBLE_DEVICES` or reducing
119118

120119
## Inference
121120

122-
### MXFP4 / MXFP8
121+
### MXFP4 & MXFP8
123122

124123
- Both pure MXFP4/MXFP8 and mix-precision model generated by target bits are supported.
125124

neural_compressor/common/base_config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,7 @@ class BaseConfig(ABC):
190190
name = BASE_CONFIG
191191
params_list = []
192192
_is_initialized = False
193+
non_tunable_params = ["white_list"]
193194

194195
def __init__(self, white_list: Optional[List[OP_NAME_OR_MODULE_TYPE]] = DEFAULT_WHITE_LIST) -> None:
195196
"""Initialize the BaseConfig.

neural_compressor/torch/algorithms/weight_only/autoround.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -370,21 +370,18 @@ def convert(self, model: torch.nn.Module, *args, **kwargs):
370370
model = rounder.model
371371
model.autoround_config = rounder.layer_config
372372

373+
self.accelerator.empty_cache()
373374
dump_model_op_stats(rounder.layer_config)
374375

375376
if self.export_format in ["auto_round", "llm_compressor"]:
376377
# the directly returned model is QuantLinear, which is used for packing.
377378
try:
378-
del model
379-
self.accelerator.empty_cache()
380379
logger.info(f"Quantization is done, reloading model from saved directory({self.output_dir})...")
381380
import transformers # pylint: disable=E0401
382381

383382
model = transformers.AutoModelForCausalLM.from_pretrained(self.output_dir)
384383
except:
385384
pass
386-
else:
387-
self.accelerator.empty_cache()
388385

389386
return model
390387

0 commit comments

Comments
 (0)