-
Notifications
You must be signed in to change notification settings - Fork 225
Description
For my application (which relies on openvino model server), I would like to use openvino-quantized models from huggingface hub and avoid doing the quantization step myself.
For chat llm models, eg, OpenVINO/Qwen2.5-7B-Instruct-int4-ov
, I'll need an additional graph.pbtxt for ovms to work. It seems that I can use the same graph.pbtxt for all models, so I can include a pre-generated graph.pbtxt .
However, for embedder (and reranker) models, eg, OpenVINO/bge-base-en-v1.5-int8-ov
, I'll need to include graph.pbtxt, openvino_detokenizer.bin, and openvino_detokenizer.xml. The tokenizer files seem to be model-dependent, so using pregenerated files is not reliable.
Is there a solution for using openvino-quantized embedder/reranker models from huggingface hub? Or do I have to quantize base models (eg, BAAI/bge-base-en-v1.5
) myself with export_model.py