Skip to content

Conversation

yuekaizhang
Copy link
Contributor

This PR supports Streaming DiT Token2wav module from https://github.com/stepfun-ai/Step-Audio2

The following results were obtained by decoding 26 sentences (172 secs total) on a single L20 GPU with the yuekai/seed_tts_cosy2 dataset.

Offline TTS (Cosyvoice2 0.5B LLM + StepAudio2 DiT Token2Wav)

Backend Batch Size llm_time_seconds total_time_seconds RTF
TRTLLM 16 2.01 5.03 0.0292

Streaming TTS (Cosyvoice2 0.5B LLM + StepAudio2 DiT Token2Wav) First Chunk Latency

Concurrent Tasks Average (ms) 50th Percentile (ms) 90th Percentile (ms) 95th Percentile (ms) 99th Percentile (ms)
1 197.50 196.13 214.65 215.96 229.21
2 281.15 278.20 345.18 361.79 395.97
4 510.65 530.50 630.13 642.44 666.65
6 921.54 918.86 1079.97 1265.22 1524.41
8 1019.95 1085.26 1371.05 1402.24 1410.66
10 1214.98 1293.54 1575.36 1654.51 2161.76

@boji123
Copy link
Contributor

boji123 commented Oct 13, 2025

巧妙的实现,通过设置推理优先级来降低首包耗时

@yuekaizhang
Copy link
Contributor Author

Disaggreated Deployment Using one L20 GPU for LLM

First chunk latency:

token2wav_num_gpu concurrent_tasks_per_gpu avg (ms) p50 (ms) p90 (ms) p99 (ms)
3 3.00 308.09 275.48 385.22 521.45
2 4.00 403.48 394.80 481.24 507.75
3 6.00 538.23 508.33 687.62 736.96
2 8.00 748.31 753.94 873.59 1007.14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants