Weight-Only Quantization Does Not Reduce ONNX Model File Size

I am using the ONNX Neural Compressor to apply weight-only quantization (WOQ) to a large language model. However, after quantization, the size of the saved ONNX model on disk remains the same as the original model.​

from onnx_neural_compressor.quantization import matmul_nbits_quantizer
import onnx
import os

algo_config = matmul_nbits_quantizer.RTNWeightOnlyQuantConfig(layer_wise_quant=True)
quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
    model_file,
    n_bits=8,
    block_size=128,
    is_symmetric=True,
    algo_config=algo_config,
)
quant.process()
best_model = quant.model
onnx.save_model(
    best_model,
    os.path.join(args.model_output, model_name),
    save_as_external_data=True,
    all_tensors_to_one_file=True,
    location="model.onnx.data",
    size_threshold=1024,
    convert_attribute=False,
)


Issue:

After running the above code, the output model file size is identical to the original model (both are approximately 4.2 GB). I expected the quantized model to be smaller due to the reduced precision of weights from FP32 to int8.​

Additional Information:

The quantization process completes without errors.

The model is saved using onnx.save_model with external data parameters.

I have verified that the quantization process modifies the model as intended.​


Could you please provide guidance on why the model size is not reduced after weight-only quantization? Is there an additional step required to compress the model size on disk?​

Thank you for your assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weight-Only Quantization Does Not Reduce ONNX Model File Size #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Weight-Only Quantization Does Not Reduce ONNX Model File Size #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions