fix bug and update readme (#2051)

xin3he · xinhe3 · web-flow · commit 30e803d4f6aa · 2024-10-31T13:59:48.000+08:00
* fix bug and update readme

---------

Signed-off-by: xinhe3 &lt;xinhe3@habana.ai&gt;
Co-authored-by: xinhe3 &lt;xinhe3@habana.ai&gt;
diff --git a/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/README.md b/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/README.md
@@ -37,7 +37,7 @@ Below is the current support status on Intel® Xeon® Scalable Processor with Py
 
 `run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates datasets accuracy provided by lm_eval, an example command is as follows.
 
-### Quantization
+### Quantization (CPU & HPU)
 
 ```bash
 python run_clm_no_trainer.py \
@@ -53,9 +53,10 @@ python run_clm_no_trainer.py \
     --gptq_use_max_length \
     --output_dir saved_results
 ```
-### Evaluation
 
-> Note: The SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false is an experimental flag which yields better performance for uint4, and it will be removed in a future release.
+> Note: `--gptq_actorder` is not supported by HPU.
+
+### Evaluation (CPU)
 
 ```bash
 # original model
@@ -65,30 +66,33 @@ python run_clm_no_trainer.py \
     --batch_size 8 \
     --tasks "lambada_openai"
 
-# quantized model
-SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=1 python run_clm_no_trainer.py \
+python run_clm_no_trainer.py \
     --model meta-llama/Llama-2-7b-hf \
     --accuracy \
     --batch_size 8 \
     --tasks "lambada_openai" \
     --load \
     --output_dir saved_results
-```
+``` 
 
-### Benchmark
+### Evaluation (HPU)
+
+> Note: The SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false is an experimental flag which yields better performance for uint4, and it will be removed in a future release.
 
 ```bash
 # original model
 python run_clm_no_trainer.py \
     --model meta-llama/Llama-2-7b-hf \
-    --performance \
-    --batch_size 8
+    --accuracy \
+    --batch_size 8 \
+    --tasks "lambada_openai"
 
 # quantized model
 SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=1 python run_clm_no_trainer.py \
     --model meta-llama/Llama-2-7b-hf \
-    --performance \
+    --accuracy \
     --batch_size 8 \
+    --tasks "lambada_openai" \
     --load \
     --output_dir saved_results
 ```
diff --git a/neural_compressor/evaluation/lm_eval/models/huggingface.py b/neural_compressor/evaluation/lm_eval/models/huggingface.py
@@ -885,7 +885,8 @@ def find_bucket(self, length):
             exit(0)
         else:
             if self.last_bucket != suitable_buckets[0]:
-                self.model.clear_cache()  # clear graph cache to avoid OOM
+                if hasattr(self.model, "clear_cache"):
+                    self.model.clear_cache()  # clear HPU graph cache to avoid OOM
                 self.last_bucket = suitable_buckets[0]
             return self.last_bucket