S1.1 shows great improvement in agents serving LLM!!!!
I tried to finetune qwen3-1.7B on implescaling/s1K-1.1_tokenized datasets , by using this repository code , cannot figure out why do I got worse inferrence result ?
Reference Running: bash train/sft.sh
{'train_runtime': 5268.8407, 'train_samples_per_second': 0.949, 'train_steps_per_second': 0.119, 'train_loss': 0.1172730620391667, 'epoch': 5.0}
uid="$(date +%Y%m%d_%H%M%S)"
base_model="../models/Qwen-Qwen3-1.7B/"
lr=1e-5
min_lr=0
epochs=3
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=1 # -> batch_size will be 16 if 16 gpus
gradient_accumulation_steps=16 # requires more GPU memory
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)
push_to_hub=false
torchrun --nproc-per-node ${gpu_count} --master_port 12345
train/sft-8B.py
--block_size=1024
--per_device_train_batch_size=${micro_batch_size}
--per_device_eval_batch_size=${micro_batch_size}
--gradient_accumulation_steps=${gradient_accumulation_steps}
--num_train_epochs=${epochs}
--train_file_path="./simplescaling/s1K-1.1_tokenized"
--model_name=${base_model}
--warmup_ratio=0.05
--fsdp="full_shard auto_wrap"
--fsdp_config="train/fsdp_config_qwen.json"
--bf16=True
--eval_strategy="no"
--logging_steps=1
--save_strategy="no"
--lr_scheduler_type="cosine"
--learning_rate=${lr}
--weight_decay=${weight_decay}
--adam_beta1=0.9
--adam_beta2=0.95
--output_dir="ckpts/s1-${uid}"
--push_to_hub=${push_to_hub}
--save_only_model=True
--gradient_checkpointing=True
with 1 GPU A100 80G ,
training procedure as follows:
Could you please help me ? Thank you!