-
Notifications
You must be signed in to change notification settings - Fork 100
Open
Description
When reproducing LLaMA-3-8B-it BT in RLHF Workflow: From Reward Modeling to Online RLHF A Comprehensive Practical Alignment Recipe of Iterative Preference Learning.
It outputs the same score for chosen vs rejected, but models like gemma-2b-it and Qwen2.5-0.5B-Instruct are right.
Could you please provide some clues about the environment or training config?
command
cd RLHF-Reward-Modeling/
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --main_process_port 29502 ./bradley-terry-rm/gemma_2B_rm_orig.py --deepspeed deepspeed_configs/deepspeed_2.json --model_name Meta-Llama-3-8B-Instruct --max_length 4096 --train_set_path hendrydong/preference_700K --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --learning_rate 2e-05 --weight_decay 0.0005 --output_path bt_models/Meta-Llama-3-8B-Instruct/
environment
CUDA 12.1 (can this be the problem?)
environment just as RLHF-Reward-Modeling/bradley-terry-rm/README.md
python=3.10.9
pip3 install torch==2.1.2 torchvision torchaudio
pip install flash-attn==2.6.3
pip install accelerate==0.33.0 # for gemma2 and llama3.1
pip install deepspeed==0.12.2
pip install transformers==4.43.4
pip install numpy==1.26.4
how I find rewards are all same
def compute_loss(...)
print(loss,{"rewards_j": rewards_j, "rewards_k": rewards_k})
grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:2', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2188],
[6.2188]], device='cuda:2', dtype=torch.bfloat16,
grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2188],
[6.2188]], device='cuda:2', dtype=torch.bfloat16,
grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:2', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2500],
[6.2500]], device='cuda:2', dtype=torch.bfloat16,
grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2500],
[6.2500]], device='cuda:2', dtype=torch.bfloat16,
grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:0', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2500],
[6.2500]], device='cuda:0', dtype=torch.bfloat16,
grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2500],
[6.2500]], device='cuda:0', dtype=torch.bfloat16,
grad_fn=<IndexBackward0>)}
I'm new to this area, thank you for great paper and code !
& Appreciate any clues !
Metadata
Metadata
Assignees
Labels
No labels