-
Notifications
You must be signed in to change notification settings - Fork 306
Description
⚙️ Your current environment
vllm=0.10.2
transformers=4.56.1
torch=2.8.0+cu128
autoawq=0.2.9 (manually installed)
🐛 Describe the bug
Hello VLLM developers,
I am using your vllm/vllm-openai:v0.10.2-x86_64 Docker image, deployed on a Linux server with 6 H800 GPUs. The model I am trying to serve is: GLM-4.6-AWQ
After entering the Docker container, I ran the following command::
vllm serve \
/data \
--served-model-name glm46 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--swap-space 16 \
--max-num-seqs 32 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 4 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
The server starts successfully:
(Apologies for the photo format, as our computers are offline.)
However, the output text is garbled:
I also tried loading the model in code using:
model = LLM('/data', tensor_parallel_size=4)
but the output is still garbled:
Next, I tried loading the model without VLLM using Transformers and AutoAWQ:
model = AutoAWQForCausalLM.from_quantized(
model_path,
fuse_layers=True,
trust_remote_code=True,
safetensors=True,
device_map="auto",
)
but it fails with: glm4_moe awq quantization isn't supported yet.
I also tried using AutoModelForCausalLM.from_pretrained, which outputs:
For reference, my environment versions are:
transformers=4.56.1
vllm=0.10.2
torch=2.8.0+cu128
These are all from the Docker image, with only autoawq=0.2.9 installed manually.
Btw,I have also tried using --chat-template, adding --quantization parameters, etc., but nothing works. I have confirmed that the model files are not corrupted.
Could you please advise on how to correctly serve this AWQ model using VLLM or Transformers?
Thank you very much!
🛠️ Steps to reproduce
No response