-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Dear contributors,
When using the triton_trtllm deployed version, there is a serious issue with number reading. The problem occurs with long numbers or when numbers are combined with English letters, such as "5000" or "5g". The error rate is extremely high in these cases.
However, if Arabic numerals are converted into Chinese characters, the reading works correctly. For example, "5千" or "五g" are read properly.
Steps to Reproduce:
Follow the instructions in README.md and use the Docker image(thanks to @yuekaizhang for providing the image):
soar97/triton-cosyvoice:25.06
1. Execute steps 1–3 in run.sh to successfully start triton_server.
2. In step 4 of run.sh, modify the following parameters:
--reference-audio → asset/zero_shot_prompt.wav (file in project root directory)
--reference-text → "希望你以后,能够做的比我过的还好呦!"
--target-text → "现在只要5499块钱就能买到最新的5g手机。"
(Ensure that the target text contains continuous numbers or a combination of numbers and English letters.)
3. Execute step 4 in run.sh to generate audio.
The issue should be reproducible under these conditions.(output audio file attached)
output.wav (--target-text → "现在只要5499块钱就能买到最新的5g手机。")
output_2.wav(--target-text → "现在只要6000块钱就能买到最新的4g手机。")
Expected Behavior:
Numbers and combinations of numbers with English letters should be read out correctly, with no mispronunciations or errors.
Actual Behavior:
When reading long numbers (e.g., "5000") or numbers combined with English letters (e.g., "5g"), the output contains mispronunciations or errors with a very high probability.
Hardware Configuration:
CPU: Intel 12490f
GPU: NVIDIA 4070 Ti Super
Environment:
OS: Ubuntu Server 22.04 LTS
NVIDIA Driver: 580.65.06
Thanks for any advices!