have trobule reproducing LLaMA-3-8B-it BT 

When reproducing LLaMA-3-8B-it BT in RLHF Workflow: From Reward Modeling to Online RLHF A Comprehensive Practical Alignment Recipe of Iterative Preference Learning.
It outputs the same score for chosen vs rejected, but models like gemma-2b-it and Qwen2.5-0.5B-Instruct are right. 
Could you please provide some clues about the environment or training config?

#https://github.com/huggingface/trl/discussions/2265

command
```
cd RLHF-Reward-Modeling/ 
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --main_process_port 29502 ./bradley-terry-rm/gemma_2B_rm_orig.py --deepspeed deepspeed_configs/deepspeed_2.json --model_name Meta-Llama-3-8B-Instruct --max_length 4096 --train_set_path hendrydong/preference_700K --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --learning_rate 2e-05 --weight_decay 0.0005  --output_path bt_models/Meta-Llama-3-8B-Instruct/
```


environment
```
CUDA 12.1 (can this be the problem?)
environment just as RLHF-Reward-Modeling/bradley-terry-rm/README.md

python=3.10.9
pip3 install torch==2.1.2 torchvision torchaudio
pip install flash-attn==2.6.3
pip install accelerate==0.33.0 # for gemma2 and llama3.1
pip install deepspeed==0.12.2
pip install transformers==4.43.4
pip install numpy==1.26.4
```

how I find rewards are all same
```
def compute_loss(...)
        print(loss,{"rewards_j": rewards_j, "rewards_k": rewards_k})
```

```
       grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:2', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2188],
        [6.2188]], device='cuda:2', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2188],
        [6.2188]], device='cuda:2', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:2', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2500],
        [6.2500]], device='cuda:2', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2500],
        [6.2500]], device='cuda:2', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:0', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2500],
        [6.2500]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2500],
        [6.2500]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>)}
```

I'm new to this area, thank you for great paper and code !
& Appreciate any clues !



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

have trobule reproducing LLaMA-3-8B-it BT #49

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

have trobule reproducing LLaMA-3-8B-it BT #49

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions