Skip to content

have trobule reproducing LLaMA-3-8B-it BT  #49

@jijivski

Description

@jijivski

When reproducing LLaMA-3-8B-it BT in RLHF Workflow: From Reward Modeling to Online RLHF A Comprehensive Practical Alignment Recipe of Iterative Preference Learning.
It outputs the same score for chosen vs rejected, but models like gemma-2b-it and Qwen2.5-0.5B-Instruct are right.
Could you please provide some clues about the environment or training config?

#huggingface/trl#2265

command

cd RLHF-Reward-Modeling/ 
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --main_process_port 29502 ./bradley-terry-rm/gemma_2B_rm_orig.py --deepspeed deepspeed_configs/deepspeed_2.json --model_name Meta-Llama-3-8B-Instruct --max_length 4096 --train_set_path hendrydong/preference_700K --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --learning_rate 2e-05 --weight_decay 0.0005  --output_path bt_models/Meta-Llama-3-8B-Instruct/

environment

CUDA 12.1 (can this be the problem?)
environment just as RLHF-Reward-Modeling/bradley-terry-rm/README.md

python=3.10.9
pip3 install torch==2.1.2 torchvision torchaudio
pip install flash-attn==2.6.3
pip install accelerate==0.33.0 # for gemma2 and llama3.1
pip install deepspeed==0.12.2
pip install transformers==4.43.4
pip install numpy==1.26.4

how I find rewards are all same

def compute_loss(...)
        print(loss,{"rewards_j": rewards_j, "rewards_k": rewards_k})
       grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:2', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2188],
        [6.2188]], device='cuda:2', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2188],
        [6.2188]], device='cuda:2', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:2', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2500],
        [6.2500]], device='cuda:2', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2500],
        [6.2500]], device='cuda:2', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>)}
tensor(0.6914, device='cuda:0', dtype=torch.bfloat16, grad_fn=<NegBackward0>) {'rewards_j': tensor([[6.2500],
        [6.2500]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>), 'rewards_k': tensor([[6.2500],
        [6.2500]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<IndexBackward0>)}

I'm new to this area, thank you for great paper and code !
& Appreciate any clues !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions