-
Notifications
You must be signed in to change notification settings - Fork 651
Open
Labels
questionFurther information is requestedFurther information is requested
Description
What is your question?
为什么我微调时,每个epoch内都有大量The grad norm is nan. Skipping updating the model.
日志
[2025-10-29 23:59:01,773][root][WARNING] - The grad norm is nan. Skipping updating the model.
[2025-10-29 23:59:01,779][root][INFO] - train, rank: 0, epoch: 0/100, data_slice: 0/1, step_in_slice: 3/44, step_in_epoch: 3, total step: 3, (loss_avg_rank: 24.221), (loss_avg_slice: 27.633), (ppl_avg_slice: 1.002e+12), (acc_avg_slice: 0.000), (lr: 3.000e-07), [('loss_ctc', 24.181), ('loss_rich', 0.041), ('loss', 24.221), ('acc_rich', 1.0)], {'data_load': '1.313', 'forward_time': '0.201', 'backward_time': '0.158', 'optim_time': '0.111', 'total_time': '1.792'}, GPU, memory: usage: 0.918 GB, peak: 5.583 GB, cache: 6.035 GB, cache_peak: 6.035 GB
[2025-10-29 23:59:03,465][root][WARNING] - The grad norm is nan. Skipping updating the model.
[2025-10-29 23:59:03,471][root][INFO] - train, rank: 0, epoch: 0/100, data_slice: 0/1, step_in_slice: 4/44, step_in_epoch: 4, total step: 4, (loss_avg_rank: 24.631), (loss_avg_slice: 26.882), (ppl_avg_slice: 4.730e+11), (acc_avg_slice: 0.000), (lr: 3.000e-07), [('loss_ctc', 24.595), ('loss_rich', 0.035), ('loss', 24.631), ('acc_rich', 1.0)], {'data_load': '1.220', 'forward_time': '0.197', 'backward_time': '0.155', 'optim_time': '0.112', 'total_time': '1.692'}, GPU, memory: usage: 0.918 GB, peak: 5.583 GB, cache: 6.037 GB, cache_peak: 6.037 GB
[2025-10-29 23:59:05,070][root][WARNING] - The grad norm is nan. Skipping updating the model.
What have you tried?
我的训练脚本
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
# MIT License (https://opensource.org/licenses/MIT)
workspace=`pwd`
# which gpu to train or finetune
# export CUDA_VISIBLE_DEVICES="0,1"
# export CUDA_VISIBLE_DEVICES="0"
export CUDA_VISIBLE_DEVICES="0"
# gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
gpu_num=1
# model_name from model_hub, or model_dir in local path
## option 1, download model automatically
model_name_or_model_dir="iic/SenseVoiceSmall"
## option 2, download model by git
#local_path_root=${workspace}/modelscope_models
#mkdir -p ${local_path_root}/${model_name_or_model_dir}
#git clone https://www.modelscope.cn/${model_name_or_model_dir}.git ${local_path_root}/${model_name_or_model_dir}
#model_name_or_model_dir=${local_path_root}/${model_name_or_model_dir}
# data dir, which contains: train.json, val.json
train_data=${workspace}/data/scpllm_train.jsonl
val_data=${workspace}/data/scpllm_val.jsonl
# exp output dir
output_dir="./outputs_hotword"
log_file="${output_dir}/log.txt"
deepspeed_config=${workspace}/deepspeed_conf/ds_stage1.json
mkdir -p ${output_dir}
echo "log_file: ${log_file}"
DISTRIBUTED_ARGS="
--nnodes ${WORLD_SIZE:-1} \
--nproc_per_node $gpu_num \
--node_rank ${RANK:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-26669}
"
echo $DISTRIBUTED_ARGS
# funasr trainer path
if [ -f `dirname $(which funasr)`/train_ds.py ]; then
train_tool=`dirname $(which funasr)`/train_ds.py
elif [ -f `dirname $(which funasr)`/../lib/python*/site-packages/funasr/bin/train_ds.py ]; then
train_tool=`dirname $(which funasr)`/../lib/python*/site-packages/funasr/bin/train_ds.py
else
echo "Error: train_ds.py not found in funasr bin directory."
train_tool=/home/ma-user/work/FunASR/funasr/bin/train_ds.py
# exit 1
fi
ABSOLUTE_PATH=$(cd $(dirname $train_tool); pwd)
train_tool=${ABSOLUTE_PATH}/train_ds.py
echo "Using funasr trainer: ${train_tool}"
torchrun $DISTRIBUTED_ARGS \
${train_tool} \
++model="${model_name_or_model_dir}" \
++trust_remote_code=true \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.data_split_num=1 \
++dataset_conf.batch_sampler="BatchSampler" \
++dataset_conf.batch_size=9000 \
++dataset_conf.sort_size=1024 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=0 \
++train_conf.max_epoch=100 \
++train_conf.log_interval=1 \
++train_conf.resume=true \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=3 \
++train_conf.avg_nbest_model=3 \
++train_conf.use_deepspeed=false \
++train_conf.use_fp16=false \
++optim=adamw \
++optim_conf.betas=[0.9,0.98] \
++optim_conf.weight_decay=0 \
++optim_conf.eps=1e-9 \
++train_conf.grad_clip=0.5 \
++scheduler_conf.warmup_steps=1000 \
++train_conf.deepspeed_config=${deepspeed_config} \
++optim_conf.lr=0.0003 \
++output_dir="${output_dir}" &> ${log_file}
# ++optim_conf.lr=0.0002 \
# ++dataset_conf.max_token_length=100 \
# ++train_conf.lr_scheduler="cosine" \
What's your environment?
- OS (e.g., Linux): Ascend: 1*ascend-snt9b1|ARM: 24核 192GB Linux
- FunASR Version (e.g., 1.0.0): 1.2.7
- ModelScope Version (e.g., 1.11.0): 1.25.0
- PyTorch Version (e.g., 2.0.0): 2.3.1
- How you installed funasr (
pip, source): source - Python version: Python 3.10.0
- GPU (e.g., V100M32) ascend-snt9b1 NPU
- CUDA/cuDNN version (e.g., cuda11.7):
- Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
- Any other relevant information:
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested