-
Notifications
You must be signed in to change notification settings - Fork 9
Description
I am using the ONNX Neural Compressor to apply weight-only quantization (WOQ) to a large language model. However, after quantization, the size of the saved ONNX model on disk remains the same as the original model.
from onnx_neural_compressor.quantization import matmul_nbits_quantizer
import onnx
import os
algo_config = matmul_nbits_quantizer.RTNWeightOnlyQuantConfig(layer_wise_quant=True)
quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
model_file,
n_bits=8,
block_size=128,
is_symmetric=True,
algo_config=algo_config,
)
quant.process()
best_model = quant.model
onnx.save_model(
best_model,
os.path.join(args.model_output, model_name),
save_as_external_data=True,
all_tensors_to_one_file=True,
location="model.onnx.data",
size_threshold=1024,
convert_attribute=False,
)
Issue:
After running the above code, the output model file size is identical to the original model (both are approximately 4.2 GB). I expected the quantized model to be smaller due to the reduced precision of weights from FP32 to int8.
Additional Information:
The quantization process completes without errors.
The model is saved using onnx.save_model with external data parameters.
I have verified that the quantization process modifies the model as intended.
Could you please provide guidance on why the model size is not reduced after weight-only quantization? Is there an additional step required to compress the model size on disk?
Thank you for your assistance.