generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Closed
Labels
Description
Reproduction
I'm training Qwen3-1.7B with GRPO:
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --host 127.0.0.1 --port 8014 --max_model_len 512 --model "Qwen/Qwen3-1.7B"
python train_grpo.pyos.environ["CUDA_VISIBLE_DEVICES"] = '2'
training_args = GRPOConfig(output_dir="Qwen3-1.7B-GRPO",
logging_steps=20,
eval_strategy='steps',
save_strategy="steps",
save_steps=500,
num_train_epochs=2000,
max_completion_length=512,
# use_vllm=True,
# bf16=True,
report_to=("wandb" if use_wandb else None),
run_name=("sucai_1:1_all" if use_wandb else None),
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_generations=4,
bf16=True,
gradient_accumulation_steps=4,
beta=0.004,
use_vllm=True,
vllm_mode="server",
# vllm_mode="colocate"
vllm_server_host='127.0.0.1',
vllm_server_port=8014,
# vllm_gpu_memory_utilization=0.3,
# vllm_guided_decoding_regex=r'<think>(.*?)</think><answer>(.*?)</answer>'
)
training_args.model_name=model_name
training_args.train_data_num=len(ds_train)
training_args.test_num=len(ds_test)
trainer = GRPOTrainer(
model=model,
reward_funcs=reward_len,
args=training_args,
train_dataset=ds_train,
eval_dataset=ds_test,
)
trainer.train()After init vllm client:
self.vllm_client = VLLMClient(
args.vllm_server_host, args.vllm_server_port, connection_timeout=args.vllm_server_timeout
)
self.vllm_client.init_communicator()It got stuck when it started generating post and didn't report an error both on the client or on the server:
url = f"http://{self.host}:{self.server_port}/generate/"
prompts=['xx','xx','xx','xx']
n=4
temperature=0.9
repetition_penalty=1.0
top_p=1.0
top_k=50
min_p=0
max_tokens=512
guided_decoding_regex=None
response = self.session.post(
url,
json={
"prompts": prompts,
"n": n,
"repetition_penalty": repetition_penalty,
"temperature": temperature,
"top_p": top_p,
"top_k": top_k,
"min_p": min_p,
"max_tokens": max_tokens,
"guided_decoding_regex": guided_decoding_regex,
},
)no error is reported:
server:
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:35524 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO 05-19 16:17:14 [block_pool.py:264] Successfully reset prefix cache
INFO: 127.0.0.1:35524 - "POST /reset_prefix_cache/ HTTP/1.1" 200 OK
<<stuck here>>
client:
0%| | 0/6152000 [00:00<?, ?it/s]
I've tried some other port. However nothing changed.
System Info
- Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.31
- Python version: 3.10.16
- TRL version: 0.18.0.dev0+4da4dc9
- PyTorch version: 2.6.0
- CUDA device(s): NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20
- Transformers version: 4.51.3
- Accelerate version: 1.6.0
- Accelerate config: not found
- Datasets version: 3.2.0
- HF Hub version: 0.31.1
- bitsandbytes version: 0.45.5
- DeepSpeed version: 0.16.7
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.78.1
- PEFT version: 0.15.2
- vLLM version: 0.8.5.post1
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete