-
Notifications
You must be signed in to change notification settings - Fork 283
autotune target_bits example for llama recipe #2344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: He, Xin3 <[email protected]>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances AutoRound quantization configuration by improving parameter management and enabling flexible mixed-precision quantization through the target_bits parameter.
Key Changes:
- Modified
target_bitsfrominttofloatin AutoRoundConfig to support fractional bit-width targets - Refactored config classes to remove redundant
params_listattributes and use dynamic generation instead - Added
non_tunable_paramsmechanism inTorchBaseConfigto exclude specific parameters from tuning - Added
output_dirparameter to AutoRoundConfig for managing temporary file storage
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
neural_compressor/torch/quantization/config.py |
Refactored config classes: removed hardcoded params_list, added non_tunable_params initialization in TorchBaseConfig.__init__, changed target_bits type to float, added output_dir parameter |
neural_compressor/common/base_config.py |
Updated tuning parameter filtering logic to check against non_tunable_params, added internal parameter filtering in to_dict() |
neural_compressor/torch/algorithms/weight_only/autoround.py |
Refactored device handling to store accelerator object, added model reloading logic for specific export formats with memory cleanup |
neural_compressor/torch/utils/auto_accelerator.py |
Enhanced CPU accelerator's empty_cache() to call gc.collect() instead of no-op |
examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/quantize.py |
New example script demonstrating AutoRound quantization with target_bits parameter |
examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md |
New comprehensive documentation for Llama3 quantization recipes and inference |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: He, Xin3 <[email protected]>
yiliu30
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM.
It would be better not to mix example changes and new features in one PR.”
...es/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md
Show resolved
Hide resolved
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Tang Kaihui <[email protected]>
Right, most issues are found during enabling example, and those are mixed aiming quick development. |
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
...es/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md
Show resolved
Hide resolved
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
...pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_quant.sh
Show resolved
Hide resolved
...pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_quant.sh
Show resolved
Hide resolved
Signed-off-by: He, Xin3 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this readme with the new examples.
https://github.com/intel/neural-compressor/tree/xinhe/vllm/examples#quantization
Signed-off-by: He, Xin3 <[email protected]>
...es/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md
Show resolved
Hide resolved
Signed-off-by: He, Xin3 <[email protected]>
|
The CI issue looks not related to this PR, I will fix it in another PR. |
PR Type
Enhancement
Description
Add
target_bitsas float in AutoRoundConfigRemove redundant
params_listin multiple classesIntroduce
non_tunable_paramsin TorchBaseConfigAdd
output_dirto AutoRoundConfigDiagram Walkthrough
File Walkthrough
1 files
Modify AutoRoundConfig and clean up config classes11 files