Skip to content

trl vllm server generating stuck #3467

@AdaChambers

Description

@AdaChambers

Reproduction

I'm training Qwen3-1.7B with GRPO:

CUDA_VISIBLE_DEVICES=0 trl vllm-serve --host 127.0.0.1 --port 8014 --max_model_len 512 --model "Qwen/Qwen3-1.7B"
python train_grpo.py
os.environ["CUDA_VISIBLE_DEVICES"] = '2'
training_args = GRPOConfig(output_dir="Qwen3-1.7B-GRPO",
                            logging_steps=20,    
                            eval_strategy='steps',
                            save_strategy="steps",
                            save_steps=500,
                            num_train_epochs=2000,
                            
                            max_completion_length=512,
                        #    use_vllm=True,
                        #    bf16=True,
                           report_to=("wandb" if use_wandb else None),  
                            run_name=("sucai_1:1_all" if use_wandb else None),
                            per_device_train_batch_size=4,
                            per_device_eval_batch_size=4,
                            num_generations=4,
                            bf16=True,
                            gradient_accumulation_steps=4,
                            beta=0.004,

                            use_vllm=True,
                            vllm_mode="server",
                            # vllm_mode="colocate"
                            vllm_server_host='127.0.0.1',
                            vllm_server_port=8014,
                            # vllm_gpu_memory_utilization=0.3,
                            # vllm_guided_decoding_regex=r'<think>(.*?)</think><answer>(.*?)</answer>'
                           )

training_args.model_name=model_name
training_args.train_data_num=len(ds_train)
training_args.test_num=len(ds_test)
trainer = GRPOTrainer(
    model=model,
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_test,
)
trainer.train()

After init vllm client:

self.vllm_client = VLLMClient(
                    args.vllm_server_host, args.vllm_server_port, connection_timeout=args.vllm_server_timeout
                )
self.vllm_client.init_communicator()

It got stuck when it started generating post and didn't report an error both on the client or on the server:

url = f"http://{self.host}:{self.server_port}/generate/"
prompts=['xx','xx','xx','xx']
n=4
temperature=0.9
repetition_penalty=1.0
top_p=1.0
top_k=50
min_p=0
max_tokens=512
guided_decoding_regex=None
response = self.session.post(
    url,
    json={
        "prompts": prompts,
        "n": n,
        "repetition_penalty": repetition_penalty,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "min_p": min_p,
        "max_tokens": max_tokens,
        "guided_decoding_regex": guided_decoding_regex,
    },
)

no error is reported:
server:

INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO 05-19 16:17:14 [block_pool.py:264] Successfully reset prefix cache
INFO:     127.0.0.1:35524 - "POST /reset_prefix_cache/ HTTP/1.1" 200 OK
<<stuck here>>

client:

 0%|                | 0/6152000 [00:00<?, ?it/s]

I've tried some other port. However nothing changed.

System Info

  • Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
  • Python version: 3.10.16
  • TRL version: 0.18.0.dev0+4da4dc9
  • PyTorch version: 2.6.0
  • CUDA device(s): NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20
  • Transformers version: 4.51.3
  • Accelerate version: 1.6.0
  • Accelerate config: not found
  • Datasets version: 3.2.0
  • HF Hub version: 0.31.1
  • bitsandbytes version: 0.45.5
  • DeepSpeed version: 0.16.7
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: 1.78.1
  • PEFT version: 0.15.2
  • vLLM version: 0.8.5.post1

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    🏋 GRPORelated to GRPO🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions