Skip to content

Weight-Only Quantization Does Not Reduce ONNX Model File Size #42

@morteza89

Description

@morteza89

I am using the ONNX Neural Compressor to apply weight-only quantization (WOQ) to a large language model. However, after quantization, the size of the saved ONNX model on disk remains the same as the original model.​

from onnx_neural_compressor.quantization import matmul_nbits_quantizer
import onnx
import os

algo_config = matmul_nbits_quantizer.RTNWeightOnlyQuantConfig(layer_wise_quant=True)
quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
model_file,
n_bits=8,
block_size=128,
is_symmetric=True,
algo_config=algo_config,
)
quant.process()
best_model = quant.model
onnx.save_model(
best_model,
os.path.join(args.model_output, model_name),
save_as_external_data=True,
all_tensors_to_one_file=True,
location="model.onnx.data",
size_threshold=1024,
convert_attribute=False,
)

Issue:

After running the above code, the output model file size is identical to the original model (both are approximately 4.2 GB). I expected the quantized model to be smaller due to the reduced precision of weights from FP32 to int8.​

Additional Information:

The quantization process completes without errors.

The model is saved using onnx.save_model with external data parameters.

I have verified that the quantization process modifies the model as intended.​

Could you please provide guidance on why the model size is not reduced after weight-only quantization? Is there an additional step required to compress the model size on disk?​

Thank you for your assistance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions