[CosyVoice2] LLM incorrectly synthesizes prompt_text in zero-shot mode

## 问题描述 (Bug Description)
在使用 CosyVoice2-0.5B 的 `inference_zero_shot()` 时，当传入 `prompt_text` 参数后，LLM 会错误地将 prompt_text 的内容也合成为语音输出，而不是仅作为声音风格的条件信号。

When using CosyVoice2-0.5B's `inference_zero_shot()` with `prompt_text`, the LLM incorrectly synthesizes the prompt text as part of the output, instead of using it only as a conditioning signal for voice cloning.

## 复现步骤 (Steps to Reproduce)
1. 使用 `inference_zero_shot()` 调用：
   - `tts_text`: "我是通义实验室语音团队全新推出的生成式语音大模型，提供舒适自然的语音合成能力。" (37字)
   - `prompt_text`: "第四十万，嗯，不要小看我们直播间好吧..." (101 tokens)
   - 参考音频: 5.74秒
2. 生成的音频长度约 28.52秒（预期应为 10-12秒）
3. 音频内容包含了 `tts_text` 和 `prompt_text` 的内容

## 预期行为 (Expected Behavior)
`prompt_text` 应该仅用于引导声音风格，不应该被合成为语音。

## 实际行为 (Actual Behavior)
`model.py` 中的 `llm_job()` 方法将 `prompt_text` 传递给 `llm.inference()`，导致 LLM 将其视为需要合成的文本。

## 对比测试
- **跨语种复刻模式**（`prompt_text` 为空）：生成 11.08秒音频 ✅ 正常
- **3s极速复刻模式**（`prompt_text` 有101个token）：生成 28.52秒音频 ❌ 异常

## 临时修复方案 (Temporary Workaround)
在 `cosyvoice/cli/model.py` 的 `llm_job()` 方法中添加：

```python
def llm_job(self, text, prompt_text, llm_prompt_speech_token, llm_embedding, uuid):
    # 临时修复：清空 prompt_text
    if prompt_text.shape[1] > 0:
        logging.info(f"[LLM Job] 检测到 prompt_text (shape={prompt_text.shape}),已强制清空")
        prompt_text = torch.zeros(1, 0, dtype=prompt_text.dtype)
    
    # 继续原有流程...
修复后音频长度恢复正常（11.08秒），RTF 从 1.06 降至 0.49。

环境信息 (Environment)

模型: CosyVoice2-0.5B

模式: Zero-shot voice cloning (3s极速复刻)

PyTorch: 2.5.0

CUDA: 12.4

操作系统: Windows

建议修复方向

建议检查 llm.inference() 方法如何处理 prompt_text 参数，确保它仅作为条件信号而不被包含在生成序列中。



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CosyVoice2] LLM incorrectly synthesizes prompt_text in zero-shot mode #1597

问题描述 (Bug Description)

复现步骤 (Steps to Reproduce)

预期行为 (Expected Behavior)

实际行为 (Actual Behavior)

对比测试

临时修复方案 (Temporary Workaround)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CosyVoice2] LLM incorrectly synthesizes prompt_text in zero-shot mode #1597

Description

问题描述 (Bug Description)

复现步骤 (Steps to Reproduce)

预期行为 (Expected Behavior)

实际行为 (Actual Behavior)

对比测试

临时修复方案 (Temporary Workaround)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions