Skip to content

为什么我微调时,每个epoch内都有大量The grad norm is nan. Skipping updating the model. #264

@qingmuhe

Description

@qingmuhe

What is your question?

为什么我微调时,每个epoch内都有大量The grad norm is nan. Skipping updating the model.

日志

[2025-10-29 23:59:01,773][root][WARNING] - The grad norm is nan. Skipping updating the model.
[2025-10-29 23:59:01,779][root][INFO] - train, rank: 0, epoch: 0/100, data_slice: 0/1, step_in_slice: 3/44, step_in_epoch: 3, total step: 3, (loss_avg_rank: 24.221), (loss_avg_slice: 27.633), (ppl_avg_slice: 1.002e+12), (acc_avg_slice: 0.000), (lr: 3.000e-07), [('loss_ctc', 24.181), ('loss_rich', 0.041), ('loss', 24.221), ('acc_rich', 1.0)], {'data_load': '1.313', 'forward_time': '0.201', 'backward_time': '0.158', 'optim_time': '0.111', 'total_time': '1.792'}, GPU, memory: usage: 0.918 GB, peak: 5.583 GB, cache: 6.035 GB, cache_peak: 6.035 GB
[2025-10-29 23:59:03,465][root][WARNING] - The grad norm is nan. Skipping updating the model.
[2025-10-29 23:59:03,471][root][INFO] - train, rank: 0, epoch: 0/100, data_slice: 0/1, step_in_slice: 4/44, step_in_epoch: 4, total step: 4, (loss_avg_rank: 24.631), (loss_avg_slice: 26.882), (ppl_avg_slice: 4.730e+11), (acc_avg_slice: 0.000), (lr: 3.000e-07), [('loss_ctc', 24.595), ('loss_rich', 0.035), ('loss', 24.631), ('acc_rich', 1.0)], {'data_load': '1.220', 'forward_time': '0.197', 'backward_time': '0.155', 'optim_time': '0.112', 'total_time': '1.692'}, GPU, memory: usage: 0.918 GB, peak: 5.583 GB, cache: 6.037 GB, cache_peak: 6.037 GB
[2025-10-29 23:59:05,070][root][WARNING] - The grad norm is nan. Skipping updating the model.

What have you tried?

我的训练脚本

# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
#  MIT License  (https://opensource.org/licenses/MIT)

workspace=`pwd`

# which gpu to train or finetune
# export CUDA_VISIBLE_DEVICES="0,1"
# export CUDA_VISIBLE_DEVICES="0"
export CUDA_VISIBLE_DEVICES="0"
# gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
gpu_num=1

# model_name from model_hub, or model_dir in local path

## option 1, download model automatically
model_name_or_model_dir="iic/SenseVoiceSmall"

## option 2, download model by git
#local_path_root=${workspace}/modelscope_models
#mkdir -p ${local_path_root}/${model_name_or_model_dir}
#git clone https://www.modelscope.cn/${model_name_or_model_dir}.git ${local_path_root}/${model_name_or_model_dir}
#model_name_or_model_dir=${local_path_root}/${model_name_or_model_dir}


# data dir, which contains: train.json, val.json
train_data=${workspace}/data/scpllm_train.jsonl
val_data=${workspace}/data/scpllm_val.jsonl

# exp output dir
output_dir="./outputs_hotword"
log_file="${output_dir}/log.txt"

deepspeed_config=${workspace}/deepspeed_conf/ds_stage1.json

mkdir -p ${output_dir}
echo "log_file: ${log_file}"

DISTRIBUTED_ARGS="
    --nnodes ${WORLD_SIZE:-1} \
    --nproc_per_node $gpu_num \
    --node_rank ${RANK:-0} \
    --master_addr ${MASTER_ADDR:-127.0.0.1} \
    --master_port ${MASTER_PORT:-26669}
"

echo $DISTRIBUTED_ARGS

# funasr trainer path
if [ -f `dirname $(which funasr)`/train_ds.py ]; then
    train_tool=`dirname $(which funasr)`/train_ds.py
elif [ -f `dirname $(which funasr)`/../lib/python*/site-packages/funasr/bin/train_ds.py ]; then
    train_tool=`dirname $(which funasr)`/../lib/python*/site-packages/funasr/bin/train_ds.py
else
    echo "Error: train_ds.py not found in funasr bin directory."
    train_tool=/home/ma-user/work/FunASR/funasr/bin/train_ds.py
    # exit 1
fi
ABSOLUTE_PATH=$(cd $(dirname $train_tool); pwd)
train_tool=${ABSOLUTE_PATH}/train_ds.py
echo "Using funasr trainer: ${train_tool}"

torchrun $DISTRIBUTED_ARGS \
${train_tool} \
++model="${model_name_or_model_dir}" \
++trust_remote_code=true \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.data_split_num=1 \
++dataset_conf.batch_sampler="BatchSampler" \
++dataset_conf.batch_size=9000  \
++dataset_conf.sort_size=1024 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=0 \
++train_conf.max_epoch=100 \
++train_conf.log_interval=1 \
++train_conf.resume=true \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=3 \
++train_conf.avg_nbest_model=3 \
++train_conf.use_deepspeed=false \
++train_conf.use_fp16=false \
++optim=adamw \
++optim_conf.betas=[0.9,0.98] \
++optim_conf.weight_decay=0 \
++optim_conf.eps=1e-9 \
++train_conf.grad_clip=0.5 \
++scheduler_conf.warmup_steps=1000 \
++train_conf.deepspeed_config=${deepspeed_config} \
++optim_conf.lr=0.0003 \
++output_dir="${output_dir}" &> ${log_file}



# ++optim_conf.lr=0.0002 \
# ++dataset_conf.max_token_length=100 \
# ++train_conf.lr_scheduler="cosine" \

What's your environment?

  • OS (e.g., Linux): Ascend: 1*ascend-snt9b1|ARM: 24核 192GB Linux
  • FunASR Version (e.g., 1.0.0): 1.2.7
  • ModelScope Version (e.g., 1.11.0): 1.25.0
  • PyTorch Version (e.g., 2.0.0): 2.3.1
  • How you installed funasr (pip, source): source
  • Python version: Python 3.10.0
  • GPU (e.g., V100M32) ascend-snt9b1 NPU
  • CUDA/cuDNN version (e.g., cuda11.7):
  • Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
  • Any other relevant information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions