Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
cca562d
migrate from speech llm
yuekaizhang Feb 26, 2025
e6897b1
make asr decode results align
yuekaizhang Feb 26, 2025
6b69276
add training stage
yuekaizhang Apr 11, 2025
202d764
remove text norm
yuekaizhang Apr 14, 2025
1d11662
fix multi rounds data
yuekaizhang Apr 14, 2025
3ad075a
s2t training
yuekaizhang Apr 15, 2025
0c02da8
refine decoding method
yuekaizhang Apr 15, 2025
458d697
fix batch_size>1 decoding bug
yuekaizhang Apr 15, 2025
bdb60f6
add codec lm
yuekaizhang Apr 21, 2025
b305cda
fix padding side
yuekaizhang Apr 21, 2025
7db4005
add flash attn support
yuekaizhang Apr 21, 2025
09d81b4
change padding side name
yuekaizhang Apr 21, 2025
23fdef2
add codec decode
yuekaizhang Apr 21, 2025
478d56e
fix bugs when padding right
yuekaizhang Apr 23, 2025
2e9be46
debug
yuekaizhang Apr 24, 2025
3642dfd
refactor code
yuekaizhang Apr 25, 2025
6955639
add qwen omni web demo
yuekaizhang Apr 25, 2025
6ea7ec8
remove offline tab
yuekaizhang Apr 25, 2025
9a07363
remove unsed
yuekaizhang Apr 25, 2025
72addd4
change place
yuekaizhang Apr 25, 2025
47920c2
add gradio demo
yuekaizhang Apr 25, 2025
71a0a44
add history cache
yuekaizhang Apr 25, 2025
d742043
refactor decode part
yuekaizhang Apr 25, 2025
448a4ee
update hf dataset loading into lhotse
yuekaizhang Apr 29, 2025
360f0aa
update README
yuekaizhang Apr 29, 2025
11bd3c9
lint
yuekaizhang Apr 29, 2025
08be51a
change pic
yuekaizhang Apr 29, 2025
2dd40b6
add vocalnet en data
yuekaizhang May 8, 2025
7cc366d
add en data, cosy2 token for training
yuekaizhang May 8, 2025
e41c1ca
add dependency
yuekaizhang May 8, 2025
37db659
remove k2 dependency
May 8, 2025
bd2df57
add debug script
yuekaizhang May 8, 2025
b20a0d0
add on the fly feature
yuekaizhang May 9, 2025
89781b9
add cosyvoice2 decode
yuekaizhang May 12, 2025
cbf3af3
add voicebench eval
yuekaizhang May 13, 2025
e657258
fix mmsu
yuekaizhang May 13, 2025
f81363d
add speech continuation pretraining
yuekaizhang May 15, 2025
bfb4ebe
remove triton
yuekaizhang May 15, 2025
0e8c1db
fix speed perturb issue
yuekaizhang May 16, 2025
e52581e
support local_rank for multi-node
yuekaizhang May 16, 2025
4a29430
add loss type
yuekaizhang May 19, 2025
50fc1ab
add multi-node
yuekaizhang May 19, 2025
9cdd393
add server url
yuekaizhang May 20, 2025
ca84aff
remove cosyvoice lib
yuekaizhang May 20, 2025
7aa6c80
add multi gpu processing
yuekaizhang May 22, 2025
7a12d88
update
yuekaizhang May 22, 2025
9fff18e
refactor code
yuekaizhang May 23, 2025
dd858f0
support instruct s2s
yuekaizhang May 23, 2025
e6e1f3f
add tts stage
yuekaizhang May 23, 2025
39700d5
refactor train to reuse code
yuekaizhang May 27, 2025
1281d7a
add tts training
yuekaizhang May 27, 2025
5a7c72c
add tts task decode
yuekaizhang May 27, 2025
49256fa
fix tts stage decode
yuekaizhang May 28, 2025
4c0396f
support text2speech ultrachat
yuekaizhang Jun 3, 2025
5becf69
remove concat three items
yuekaizhang Jun 3, 2025
80677a5
remove stats
yuekaizhang Jun 3, 2025
559f9e2
fix repeat bos and pad id
yuekaizhang Jun 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions egs/speech_llm/SPEECH2SPEECH/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@

# Introduction

This recipe includes scripts for training speech2speech models.

# SPEECH2SPEECH

The following table lists the folders for different tasks.

|Recipe | Speech Input | Speech Output | Comment|
|--------------|--------------|---------------|--------|
|Qwen-omni like| Continuous Embeddins| Cosyvoice1 50Hz Single-codebook Token | Text-driven; using Thinker LLM for text token, small Talker LLM for speech token |

### [Qwen-omni like Speech2speech Recipe](./qwen_omni)

[Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) style model using [worstchan/Belle_1.4M-SLAM-Omni](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni) dataset.

<br>
<p align="center">
<img src="assets/framework.png" width="800"/>
<p>
<br>

Command for training is:
```bash
torchrun --nproc_per_node $ngpu ./qwen_omni/train.py \
--max-duration 50 \
--enable-musan False \
--exp-dir $exp_dir \
--speech-encoder-path-or-name models/whisper/v1.1/whisper-large-v2-multi-hans-zh-epoch-3-avg-10.pt \
--llm-path-or-name Qwen/Qwen2.5-0.5B-Instruct \
--manifest-dir data/fbank \
--deepspeed \
--deepspeed_config ./qwen_omni/ds_config_zero1.json \
--use-flash-attn True \
--use-lora True --unfreeze-llm True --unfreeze-speech-projector True --enable-speech-output True
```

Command for decoding is:
```bash
python3 ./qwen_omni/decode.py \
--max-duration 1 \
--exp-dir $exp_dir \
--speech-encoder-path-or-name models/whisper/v1.1/whisper-large-v2-multi-hans-zh-epoch-3-avg-10.pt \
--llm-path-or-name models/Qwen2.5-0.5B-Instruct \
--epoch 999 --avg 1 \
--manifest-dir data/fbank \
--use-flash-attn True \
--method e2e-epoch10_speech2speech \
--enable-speech-output True \
--token2wav-path models/CosyVoice-300M-SFT \
--use-lora True
```

Please see [`prepare.sh`](./prepare.sh) for more details.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
234 changes: 234 additions & 0 deletions egs/speech_llm/SPEECH2SPEECH/exp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
#!/usr/bin/env bash

# fix segmentation fault reported in https://github.com/k2-fsa/icefall/issues/674
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
export PYTHONPATH=$PYTHONPATH:/workspace/CosyVoice
# export HF_HOME="/lustre/fsw/general_sa/yuekaiz/.cache/huggingface"
set -eou pipefail

stage=$1
stop_stage=$2


log() {
# This function is from espnet
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}

if [ $stage -le 17 ] && [ $stop_stage -ge 17 ]; then
echo "cd /workspace && ln -s /lustre/fsw/general_sa/yuekaiz/s2s slam && cd -"
if [ ! -L "/workspace/slam" ]; then
cd /workspace && ln -s /lustre/fsw/general_sa/yuekaiz/s2s slam && cd -
fi
log "stage 17: Training Speech2Speech Model, full parameters"
exp_dir=./qwen_omni/exp_speech2text_first_multi_en_continuation_second_three_s2s
pretrained_dir=./qwen_omni/exp_speech2text
ngpu=4

latest_checkpoint_step=-1
# Check if exp_dir exists and is a directory
if [ -d "$exp_dir" ]; then
# List directories matching checkpoint-* and find the one with the largest step number
for checkpoint_dir in $(ls -d $exp_dir/checkpoint-*/ 2>/dev/null | sort -V); do
checkpoint_name=$(basename "$checkpoint_dir") # e.g., checkpoint-1000
# Extract step number using parameter expansion
current_step=${checkpoint_name#checkpoint-}
# Ensure current_step is a number
if [[ "$current_step" =~ ^[0-9]+$ ]] && [ "$current_step" -gt "$latest_checkpoint_step" ]; then
latest_checkpoint_step=$current_step
fi
done
fi

train_cmd_args="--max-duration 200 \
--enable-musan False \
--exp-dir $exp_dir \
--last-stage-model-path $pretrained_dir/checkpoint-58548/pytorch_model.bin \
--speech-encoder-path-or-name models/large-v2.pt \
--llm-path-or-name models/Qwen2.5-0.5B-Instruct \
--on-the-fly-feats True --on-the-fly-speed-perturb False\
--deepspeed \
--huggingface-dataset-path-or-name /lustre/fsw/general_sa/yuekaiz/s2s \
--deepspeed_config ./qwen_omni/ds_config_zero1.json \
--use-flash-attn True --on-the-fly-feats True \
--dataset vocalnet_ultrachat_voiceassistant_instruct_s2s --num-epochs 10 \
--use-lora True --unfreeze-llm True --unfreeze-speech-projector True --enable-speech-output False"

if [ "$latest_checkpoint_step" -ge 0 ]; then
log "Continuing training from checkpoint-$latest_checkpoint_step"
step=$latest_checkpoint_step
train_cmd_args="$train_cmd_args --pretrained-model-path $exp_dir/checkpoint-${step}/pytorch_model.bin --sampler-state-dict-path $exp_dir/checkpoint-${step}/sampler.pt"
else
log "Starting training from scratch as no checkpoint was found in $exp_dir"
# No pretrained model or sampler state dict needed for the first run
fi

torchrun --nproc_per_node $ngpu --nnodes $SLURM_JOB_NUM_NODES --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d --rdzv_id $SLURM_JOBID ./qwen_omni/train.py \
$train_cmd_args
fi

if [ $stage -le 18 ] && [ $stop_stage -ge 18 ]; then
echo "cd /workspace && ln -s /lustre/fsw/general_sa/yuekaiz/s2s slam && cd -"
# check if the link exists, if not exist, create it
if [ ! -L "/workspace/slam" ]; then
cd /workspace && ln -s /lustre/fsw/general_sa/yuekaiz/s2s slam && cd -
fi
log "stage 17: Training Speech2Speech Model, full parameters"
exp_dir=./qwen_omni/exp_speech2text_first_multi_en_continuation_second_three_s2s_librispeech
pretrained_dir=./qwen_omni/exp_speech2text
ngpu=4

latest_checkpoint_step=-1
# Check if exp_dir exists and is a directory
if [ -d "$exp_dir" ]; then
# List directories matching checkpoint-* and find the one with the largest step number
for checkpoint_dir in $(ls -d $exp_dir/checkpoint-*/ 2>/dev/null | sort -V); do
checkpoint_name=$(basename "$checkpoint_dir") # e.g., checkpoint-1000
# Extract step number using parameter expansion
current_step=${checkpoint_name#checkpoint-}
# Ensure current_step is a number
if [[ "$current_step" =~ ^[0-9]+$ ]] && [ "$current_step" -gt "$latest_checkpoint_step" ]; then
latest_checkpoint_step=$current_step
fi
done
fi

train_cmd_args="--max-duration 200 \
--enable-musan False \
--exp-dir $exp_dir \
--last-stage-model-path $pretrained_dir/checkpoint-58548/pytorch_model.bin \
--speech-encoder-path-or-name models/large-v2.pt \
--llm-path-or-name models/Qwen2.5-0.5B-Instruct \
--on-the-fly-feats True --on-the-fly-speed-perturb False\
--deepspeed \
--huggingface-dataset-path-or-name /lustre/fsw/general_sa/yuekaiz/s2s \
--deepspeed_config ./qwen_omni/ds_config_zero1.json \
--use-flash-attn True --on-the-fly-feats True \
--dataset vocalnet_ultrachat_voiceassistant_instruct_s2s_librispeech --num-epochs 10 \
--use-lora True --unfreeze-llm True --unfreeze-speech-projector True --enable-speech-output False"

if [ "$latest_checkpoint_step" -ge 0 ]; then
log "Continuing training from checkpoint-$latest_checkpoint_step"
step=$latest_checkpoint_step
train_cmd_args="$train_cmd_args --pretrained-model-path $exp_dir/checkpoint-${step}/pytorch_model.bin --sampler-state-dict-path $exp_dir/checkpoint-${step}/sampler.pt"
else
log "Starting training from scratch as no checkpoint was found in $exp_dir"
# No pretrained model or sampler state dict needed for the first run
fi

torchrun --nproc_per_node $ngpu --nnodes $SLURM_JOB_NUM_NODES --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d --rdzv_id $SLURM_JOBID ./qwen_omni/train.py \
$train_cmd_args
fi

if [ $stage -le 19 ] && [ $stop_stage -ge 19 ]; then
log "stage 19: Training TTS Model"
exp_dir=./qwen_omni/exp_tts_ultra_chat_voice_assistant
exp_dir=./qwen_omni/exp_tts_emilia_en_tts_only_template
exp_dir=./qwen_omni/exp_tts_emilia_en_tts_three_concat
pretrained_dir=./qwen_omni/exp_speech2text
ngpu=4

latest_checkpoint_step=-1
# Check if exp_dir exists and is a directory
if [ -d "$exp_dir" ]; then
# List directories matching checkpoint-* and find the one with the largest step number
for checkpoint_dir in $(ls -d $exp_dir/checkpoint-*/ 2>/dev/null | sort -V); do
checkpoint_name=$(basename "$checkpoint_dir") # e.g., checkpoint-1000
# Extract step number using parameter expansion
current_step=${checkpoint_name#checkpoint-}
# Ensure current_step is a number
if [[ "$current_step" =~ ^[0-9]+$ ]] && [ "$current_step" -gt "$latest_checkpoint_step" ]; then
latest_checkpoint_step=$current_step
fi
done
fi
# --dataset ultra_chat_voice_assistant
train_cmd_args="--batch-size 30 \
--exp-dir $exp_dir \
--llm-path-or-name models/Qwen2.5-0.5B-Instruct \
--enable-speech-input False \
--deepspeed \
--dataset /lustre/fsw/general_sa/yuekaiz/s2s/VoxBox/manifests_emilia_en \
--deepspeed_config ./qwen_omni/ds_config_zero1.json \
--use-flash-attn True \
--num-epochs 3 \
--use-lora False --unfreeze-llm False --enable-speech-output True"

if [ "$latest_checkpoint_step" -ge 0 ]; then
log "Continuing training from checkpoint-$latest_checkpoint_step"
step=$latest_checkpoint_step
train_cmd_args="$train_cmd_args --pretrained-model-path $exp_dir/checkpoint-${step}/pytorch_model.bin --sampler-state-dict-path $exp_dir/checkpoint-${step}/sampler.pt"
else
log "Starting training from scratch as no checkpoint was found in $exp_dir"
# No pretrained model or sampler state dict needed for the first run
fi

torchrun --nproc_per_node $ngpu --nnodes $SLURM_JOB_NUM_NODES --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d --rdzv_id $SLURM_JOBID ./qwen_omni/train_tts.py \
$train_cmd_args
fi


# if [ $stage -le 20 ] && [ $stop_stage -ge 20 ]; then
# log "stage 20: Training TTS Model"
# echo "cd /workspace && ln -s /lustre/fsw/general_sa/yuekaiz/s2s slam && cd -"
# if [ ! -L "/workspace/slam" ]; then
# cd /workspace && ln -s /lustre/fsw/general_sa/yuekaiz/s2s slam && cd -
# fi
# exp_dir=./qwen_omni/exp_test
# ngpu=4

# latest_checkpoint_step=-1
# # Check if exp_dir exists and is a directory
# if [ -d "$exp_dir" ]; then
# # List directories matching checkpoint-* and find the one with the largest step number
# for checkpoint_dir in $(ls -d $exp_dir/checkpoint-*/ 2>/dev/null | sort -V); do
# checkpoint_name=$(basename "$checkpoint_dir") # e.g., checkpoint-1000
# # Extract step number using parameter expansion
# current_step=${checkpoint_name#checkpoint-}
# # Ensure current_step is a number
# if [[ "$current_step" =~ ^[0-9]+$ ]] && [ "$current_step" -gt "$latest_checkpoint_step" ]; then
# latest_checkpoint_step=$current_step
# fi
# done
# fi

# train_cmd_args="--max-duration 150 \
# --enable-musan False \
# --exp-dir $exp_dir \
# --speech-encoder-path-or-name models/large-v2.pt \
# --llm-path-or-name Qwen/Qwen2.5-0.5B-Instruct \
# --dataset vocalnet_ultrachat_voiceassistant \
# --manifest-dir data/fbank \
# --deepspeed \
# --deepspeed_config ./qwen_omni/ds_config_zero1.json \
# --use-flash-attn True --on-the-fly-feats True \
# --use-lora True --unfreeze-llm True --unfreeze-speech-projector True --enable-speech-output True"

# if [ "$latest_checkpoint_step" -ge 0 ]; then
# log "Continuing training from checkpoint-$latest_checkpoint_step"
# step=$latest_checkpoint_step
# train_cmd_args="$train_cmd_args --pretrained-model-path $exp_dir/checkpoint-${step}/pytorch_model.bin --sampler-state-dict-path $exp_dir/checkpoint-${step}/sampler.pt"
# else
# log "Starting training from scratch as no checkpoint was found in $exp_dir"
# # No pretrained model or sampler state dict needed for the first run
# fi

# torchrun --nproc_per_node $ngpu --nnodes $SLURM_JOB_NUM_NODES --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d --rdzv_id $SLURM_JOBID ./qwen_omni/train.py \
# $train_cmd_args
# fi


# if [ $stage -le 21 ] && [ $stop_stage -ge 21 ]; then
# log "stage 21: TTS Decoding Test Set"
# exp_dir=./qwen_omni/exp_tts
# torchrun --nproc_per_node=2 ./qwen_omni/decode_tts.py \
# --exp-dir $exp_dir \
# --speech-encoder-path-or-name models/large-v2.pt \
# --llm-path-or-name models/Qwen2.5-0.5B-Instruct \
# --pretrained-model-path $exp_dir/checkpoint-32001/pytorch_model.bin \
# --use-flash-attn True \
# --enable-speech-output True \
# --token2wav-path /workspace/CosyVoice2-0.5B \
# --use-lora True
# fi
Loading
Loading