使用docker TensorRT-LLM 进行RPC 流式推理测试。但是出现了CUDA句柄错误 上下文错误。

如题，使用WSL ubuntu24.04 4090 24G CUDA12.9
代码没有修改，采用流式测试，进行rpc测试 ，但是出现了cuda句柄错误 上下文错误。
以下是初始化+服务后端log

(Cenv) serial@sariel:~/workserver/TensorRTC$ docker compose up -d
WARN[0000] The "MODEL_ID" variable is not set. Defaulting to a blank string. 
[+] Running 1/1
 ✔ Container tensorrtc-tts-1  Started                                                                                      0.3s 
(Cenv) serial@sariel:~/workserver/TensorRTC$ docker compose up
WARN[0000] The "MODEL_ID" variable is not set. Defaulting to a blank string. 
[+] Running 1/1
 ✔ Container tensorrtc-tts-1  Running                                                                                      0.0s 
Attaching to tts-1
tts-1  | Submodule 'third_party/Matcha-TTS' (https://github.com/shivammehta25/Matcha-TTS.git) registered for path 'third_party/Matcha-TTS'
tts-1  | Cloning into '/workspace/CosyVoice/third_party/Matcha-TTS'...
tts-1  | Submodule path 'third_party/Matcha-TTS': checked out 'dd9105b34bf2be2230f4aa1e4769fb586a3c824e'
tts-1  | Downloading CosyVoice2-0.5B
Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]Downloading 'model.safetensors' to 'cosyvoice2_llm/.cache/huggingface/download/xGOKKLRSlIhH692hSVvI1-gpoa8=.91d53bdc8f6bf5752f40e1400305a64f6c8a8e335336ea0f5d5eaac8da974050.incomplete'
tts-1  | Downloading 'tokenizer.json' to 'cosyvoice2_llm/.cache/huggingface/download/HgM_lKo9sdSCfRtVg7MMFS7EKqo=.bf18af7d528d3912841a5cc768a7948f7c8565b2f6dbf8ee3d02e6cf58df98fc.incomplete'
tts-1  | Downloading 'merges.txt' to 'cosyvoice2_llm/.cache/huggingface/download/PtHk0z_I45atnj23IIRhTExwT3w=.6ed63830772e0c3879f54f26b056a9b2bf5ad8f4.incomplete'
tts-1  | Downloading 'generation_config.json' to 'cosyvoice2_llm/.cache/huggingface/download/3EVKVggOldJcKSsGjSdoUCN1AyQ=.2845e56f291aa8ab6c1d5db1509a78ebe5f809e5.incomplete'
tts-1  | Downloading 'special_tokens_map.json' to 'cosyvoice2_llm/.cache/huggingface/download/ahkChHUJFxEmOdq5GDFEmerRzCY=.b023bc245a90c287c1c2e3459d1c9f7e28eb1bea.incomplete'
tts-1  | Downloading 'added_tokens.json' to 'cosyvoice2_llm/.cache/huggingface/download/SeqzFlf9ZNZ3or_wZAOIdsM3Yxw=.969a9654d282db3f5763b5a8129cd6653428e486.incomplete'
tts-1  | Downloading '.gitattributes' to 'cosyvoice2_llm/.cache/huggingface/download/wPaCkH-WbT7GsmxMKKrNZTV4nSM=.52373fe24473b1aa44333d318f578ae6bf04b49b.incomplete'
tts-1  | Downloading 'config.json' to 'cosyvoice2_llm/.cache/huggingface/download/8_PA_wEVGiVa2goH2H4KQOQpvVY=.cfb66b8d1dd57d8f40492c0da59d727029542520.incomplete'
tts-1  | Download complete. Moving file to cosyvoice2_llm/.gitattributes
Fetching 10 files:  10%|█         | 1/10 [00:08<01:13,  8.11s/it]Download complete. Moving file to cosyvoice2_llm/special_tokens_map.json
tts-1  | Download complete. Moving file to cosyvoice2_llm/config.json
tts-1  | Download complete. Moving file to cosyvoice2_llm/added_tokens.json
Fetching 10 files:  20%|██        | 2/10 [00:08<00:27,  3.45s/it]Download complete. Moving file to cosyvoice2_llm/generation_config.json
Fetching 10 files:  40%|████      | 4/10 [00:08<00:08,  1.37s/it]Downloading 'vocab.json' to 'cosyvoice2_llm/.cache/huggingface/download/j3m-Hy6QvBddw8RXA1uSWl1AJ0c=.4783fe10ac3adce15ac8f358ef5462739852c569.incomplete'
tts-1  | Downloading 'tokenizer_config.json' to 'cosyvoice2_llm/.cache/huggingface/download/vzaExXFZNBay89bvlQv-ZcI6BTg=.6c4f39ba12fba8370691a7004688316d1829e158.incomplete'
tts-1  | Download complete. Moving file to cosyvoice2_llm/tokenizer_config.json
tts-1  | Download complete. Moving file to cosyvoice2_llm/merges.txt
Fetching 10 files:  50%|█████     | 5/10 [00:09<00:05,  1.13s/it]Download complete. Moving file to cosyvoice2_llm/vocab.json
tts-1  | Download complete. Moving file to cosyvoice2_llm/tokenizer.json
tts-1  | Download complete. Moving file to cosyvoice2_llm/model.safetensors
Fetching 10 files: 100%|██████████| 10/10 [00:22<00:00,  2.24s/it]
tts-1  | /workspace/CosyVoice/runtime/triton_trtllm/cosyvoice2_llm
tts-1  | 
tts-1  |  _   .-')                _ .-') _     ('-.             .-')                              _ (`-.    ('-.
tts-1  | ( '.( OO )_             ( (  OO) )  _(  OO)           ( OO ).                           ( (OO  ) _(  OO)
tts-1  |  ,--.   ,--.).-'),-----. \     .'_ (,------.,--.     (_)---\_)   .-----.  .-'),-----.  _.`     \(,------.
tts-1  |  |   `.'   |( OO'  .-.  ',`'--..._) |  .---'|  |.-') /    _ |   '  .--./ ( OO'  .-.  '(__...--'' |  .---'
tts-1  |  |         |/   |  | |  ||  |  \  ' |  |    |  | OO )\  :` `.   |  |('-. /   |  | |  | |  /  | | |  |
tts-1  |  |  |'.'|  |\_) |  |\|  ||  |   ' |(|  '--. |  |`-' | '..`''.) /_) |OO  )\_) |  |\|  | |  |_.' |(|  '--.
tts-1  |  |  |   |  |  \ |  | |  ||  |   / : |  .--'(|  '---.'.-._)   \ ||  |`-'|   \ |  | |  | |  .___.' |  .--'
tts-1  |  |  |   |  |   `'  '-'  '|  '--'  / |  `---.|      | \       /(_'  '--'\    `'  '-'  ' |  |      |  `---.
tts-1  |  `--'   `--'     `-----' `-------'  `------'`------'  `-----'    `-----'      `-----'  `--'      `------'
tts-1  | 
tts-1  | Downloading Model from https://www.modelscope.cn to directory: /workspace/CosyVoice/runtime/triton_trtllm/CosyVoice2-0.5B
Downloading [CosyVoice-BlankEN/config.json]: 100%|██████████| 659/659 [00:03<00:00, 203B/s]
Downloading [configuration.json]: 100%|██████████| 47.0/47.0 [00:03<00:00, 13.9B/s]
Downloading [asset/dingding.png]: 100%|██████████| 94.1k/94.1k [00:03<00:00, 26.1kB/s]
Downloading [campplus.onnx]: 100%|██████████| 27.0M/27.0M [00:06<00:00, 4.16MB/s]
Downloading [cosyvoice2.yaml]: 100%|██████████| 7.16k/7.16k [00:08<00:00, 891B/s]
Downloading [CosyVoice-BlankEN/generation_config.json]: 100%|██████████| 242/242 [00:07<00:00, 31.8B/s]
Downloading [flow.encoder.fp16.zip]: 100%|██████████| 111M/111M [00:14<00:00, 7.80MB/s]
Downloading [CosyVoice-BlankEN/merges.txt]: 100%|██████████| 1.34M/1.34M [00:03<00:00, 373kB/s]
Downloading [hift.pt]: 100%|██████████| 79.5M/79.5M [00:13<00:00, 6.01MB/s]
Downloading [README.md]: 100%|██████████| 11.8k/11.8k [00:05<00:00, 2.05kB/s]
Downloading [CosyVoice-BlankEN/tokenizer_config.json]: 100%|██████████| 1.26k/1.26k [00:04<00:00, 310B/s]
Downloading [flow.encoder.fp32.zip]: 100%|██████████| 183M/183M [00:22<00:00, 8.63MB/s]
Downloading [CosyVoice-BlankEN/vocab.json]: 100%|██████████| 2.65M/2.65M [00:02<00:00, 1.24MB/s]
Downloading [flow.decoder.estimator.fp32.onnx]: 100%|██████████| 273M/273M [00:27<00:00, 10.5MB/s]
Downloading [flow.pt]: 100%|██████████| 430M/430M [00:29<00:00, 15.3MB/s]
Downloading [flow.cache.pt]: 100%|██████████| 430M/430M [00:41<00:00, 10.8MB/s]
Downloading [speech_tokenizer_v2.onnx]: 100%|██████████| 473M/473M [00:35<00:00, 14.1MB/s]
Downloading [CosyVoice-BlankEN/model.safetensors]: 100%|██████████| 942M/942M [00:46<00:00, 21.3MB/s]
Downloading [llm.pt]: 100%|██████████| 1.88G/1.88G [01:42<00:00, 19.8MB/s]
Processing 19 items: 100%|██████████| 19.0/19.0 [01:50<00:00, 5.82s/it]
tts-1  | 
tts-1  | Successfully Downloaded from model iic/CosyVoice2-0.5B.
tts-1  | 
tts-1  | --2025-10-09 13:54:57--  https://raw.githubusercontent.com/qi-hua/async_cosyvoice/main/CosyVoice2-0.5B/spk2info.pt
tts-1  | Connecting to 10.1.20.70:20809... connected.
tts-1  | Proxy request sent, awaiting response... 200 OK
tts-1  | Length: 180930 (177K) [application/octet-stream]
tts-1  | Saving to: './CosyVoice2-0.5B/spk2info.pt'
tts-1  | 
tts-1  |      0K .......... .......... .......... .......... .......... 28%  149K 1s
tts-1  |     50K .......... .......... .......... .......... .......... 56%  435K 0s
tts-1  |    100K .......... .......... .......... .......... .......... 84%  715K 0s
tts-1  |    150K .......... .......... ......                          100% 3.28M=0.5s
tts-1  | 
tts-1  | 2025-10-09 13:55:01 (334 KB/s) - './CosyVoice2-0.5B/spk2info.pt' saved [180930/180930]
tts-1  | 
tts-1  | Converting checkpoint to TensorRT weights
tts-1  | <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
tts-1  | <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
tts-1  | 2025-10-09 13:55:08,748 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
tts-1  | /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2330: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
tts-1  | If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
tts-1  |   warnings.warn(
tts-1  | [TensorRT-LLM] TensorRT-LLM version: 0.20.0
tts-1  | 0.20.0
198it [00:00, 1866.69it/s]
tts-1  | Total time of converting checkpoints: 00:00:02
tts-1  | Building TensorRT engines
tts-1  | <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
tts-1  | <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
tts-1  | 2025-10-09 13:55:15,704 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
tts-1  | /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2330: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
tts-1  | If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
tts-1  |   warnings.warn(
tts-1  | [TensorRT-LLM] TensorRT-LLM version: 0.20.0
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set bert_attention_plugin to auto.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set nccl_plugin to auto.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set lora_plugin to None.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set dora_plugin to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set moe_plugin to auto.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set context_fmha to True.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set remove_input_padding to True.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set norm_quant_fusion to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set reduce_fusion to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set user_buffer to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set tokens_per_block to 32.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set use_paged_context_fmha to True.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set fuse_fp4_quant to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set multiple_profiles to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set paged_state to True.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set streamingllm to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set use_fused_mlp to True.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set pp_reduce_scatter to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Compute capability: (8, 9)
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] SM count: 128
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] SM clock: 3105 MHz
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] int4 TFLOPS: 813
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] int8 TFLOPS: 406
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] fp8 TFLOPS: 406
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] float16 TFLOPS: 203
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] bfloat16 TFLOPS: 203
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] float32 TFLOPS: 101
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Total Memory: 23 GiB
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Memory clock: 10501 MHz
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Memory bus width: 384
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Memory bandwidth: 1008 GB/s
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] PCIe speed: 16000 Mbps
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] PCIe link width: 16
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] PCIe bandwidth: 32 GB/s
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] Provided but not required tensors: {'rotary_inv_freq', 'embed_positions_for_gpt_attention', 'embed_positions'}
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set dtype to bfloat16.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set paged_kv_cache to True.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] Overriding paged_state to False
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set paged_state to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 
tts-1  | 
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] Specifying a `max_num_tokens` larger than 16384 is usually not recommended, we do not expect perf gain with that and too large `max_num_tokens` could possibly exceed the TensorRT tensor volume, causing runtime errors. Got `max_num_tokens` = 32768
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
tts-1  | [10/09/2025-13:55:15] [TRT-LLM] [W] FP8 Context FMHA is disabled because it must be used together with the fp8 quantization workflow.
tts-1  | [10/09/2025-13:55:16] [TRT] [I] [MemUsageChange] Init CUDA: CPU +20, GPU +0, now: CPU 662, GPU 1571 (MiB)
tts-1  | [10/09/2025-13:55:19] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1746, GPU +8, now: CPU 2610, GPU 1579 (MiB)
tts-1  | [10/09/2025-13:55:19] [TRT-LLM] [I] Set nccl_plugin to None.
tts-1  | [10/09/2025-13:55:19] [TRT-LLM] [I] Total time of constructing network from module object 3.763633966445923 seconds
tts-1  | [10/09/2025-13:55:19] [TRT-LLM] [I] Total optimization profiles added: 1
tts-1  | [10/09/2025-13:55:19] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
tts-1  | [10/09/2025-13:55:19] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
tts-1  | [10/09/2025-13:55:19] [TRT] [W] Unused Input: position_ids
tts-1  | [10/09/2025-13:55:21] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
tts-1  | [10/09/2025-13:55:21] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
tts-1  | [10/09/2025-13:55:21] [TRT] [I] Compiler backend is used during engine build.
tts-1  | [10/09/2025-13:55:23] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
tts-1  | [10/09/2025-13:55:23] [TRT] [I] Detected 17 inputs and 1 output network tensors.
tts-1  | [10/09/2025-13:55:24] [TRT] [I] Total Host Persistent Memory: 77248 bytes
tts-1  | [10/09/2025-13:55:24] [TRT] [I] Total Device Persistent Memory: 0 bytes
tts-1  | [10/09/2025-13:55:24] [TRT] [I] Max Scratch Memory: 140786688 bytes
tts-1  | [10/09/2025-13:55:24] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 402 steps to complete.
tts-1  | [10/09/2025-13:55:24] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 10.9448ms to assign 18 blocks to 402 nodes requiring 1073777152 bytes.
tts-1  | [10/09/2025-13:55:24] [TRT] [I] Total Activation Memory: 1073776640 bytes
tts-1  | [10/09/2025-13:55:24] [TRT] [I] Total Weights Memory: 1291661568 bytes
tts-1  | [10/09/2025-13:55:24] [TRT] [I] Compiler backend is used during engine execution.
tts-1  | [10/09/2025-13:55:24] [TRT] [I] Engine generation completed in 3.35634 seconds.
tts-1  | [10/09/2025-13:55:24] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 1232 MiB
tts-1  | [10/09/2025-13:55:25] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:06
tts-1  | [10/09/2025-13:55:25] [TRT] [I] Serialized 27 bytes of code generator cache.
tts-1  | [10/09/2025-13:55:25] [TRT] [I] Serialized 131417 bytes of compilation cache.
tts-1  | [10/09/2025-13:55:25] [TRT] [I] Serialized 9 timing cache entries
tts-1  | [10/09/2025-13:55:25] [TRT-LLM] [I] Timing cache serialized to model.cache
tts-1  | [10/09/2025-13:55:25] [TRT-LLM] [I] Build phase peak memory: 9575.57 MB, children: 16.25 MB
tts-1  | [10/09/2025-13:55:25] [TRT-LLM] [I] Serializing engine to ./trt_engines_bfloat16/rank0.engine...
tts-1  | [10/09/2025-13:55:26] [TRT-LLM] [I] Engine serialized. Total time: 00:00:00
tts-1  | [10/09/2025-13:55:26] [TRT-LLM] [I] Total time of building all engines: 00:00:11
tts-1  | Testing TensorRT engines
tts-1  | <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
tts-1  | <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
tts-1  | 2025-10-09 13:55:31,682 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
tts-1  | /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2330: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
tts-1  | If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
tts-1  |   warnings.warn(
tts-1  | [TensorRT-LLM] TensorRT-LLM version: 0.20.0
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [V] Input token ids (batch_size = 1):
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [V] Request 0: [158227, 56568, 52801, 37945, 56007, 56568, 99882, 99217, 81596, 11319, 158228]
tts-1  | [TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
tts-1  | [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set dtype to bfloat16.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set bert_attention_plugin to auto.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set qserve_gemm_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set identity_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set nccl_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set lora_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set dora_plugin to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set smooth_quant_plugins to True.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set moe_plugin to auto.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set context_fmha to True.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set paged_kv_cache to True.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set remove_input_padding to True.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set norm_quant_fusion to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set reduce_fusion to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set user_buffer to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set tokens_per_block to 32.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set use_paged_context_fmha to True.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set fuse_fp4_quant to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set multiple_profiles to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set paged_state to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set streamingllm to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set manage_weights to False.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set use_fused_mlp to True.
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Set pp_reduce_scatter to False.
tts-1  | [TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
tts-1  | [TensorRT-LLM][INFO] Refreshed the MPI local session
tts-1  | [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
tts-1  | [TensorRT-LLM][INFO] Rank 0 is using GPU 0
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
tts-1  | [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 24
tts-1  | [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
tts-1  | [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 32768
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 32767 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
tts-1  | [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
tts-1  | [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
tts-1  | [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
tts-1  | [TensorRT-LLM][INFO] Loaded engine size: 1237 MiB
tts-1  | [TensorRT-LLM][INFO] Engine load time 405 ms
tts-1  | [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
tts-1  | [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1024.03 MiB for execution context memory.
tts-1  | [TensorRT-LLM][INFO] gatherContextLogits: 0
tts-1  | [TensorRT-LLM][INFO] gatherGenerationLogits: 0
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1231 (MiB)
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.35 MB GPU memory for runtime buffers.
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder.
tts-1  | [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 20.21 GiB, extraCostMemory: 0.00 GiB
tts-1  | [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33110
tts-1  | [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
tts-1  | [TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
tts-1  | [TensorRT-LLM][INFO] Max KV cache pages per sequence: 1024 [window size=32768]
tts-1  | [TensorRT-LLM][INFO] Number of tokens per block: 32.
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 12.13 GiB for max tokens in paged KV cache (1059520).
tts-1  | [10/09/2025-13:55:32] [TRT-LLM] [I] Load engine takes: 0.9719598293304443 sec
tts-1  | Input [Text 0]: "<|sos|>你好，请问你叫什么？<|task_id|>"
tts-1  | Output [Text 0]: "<|s_2032|><|s_3914|><|s_539|><|s_4859|><|s_5112|><|s_6084|><|s_6329|><|s_4217|><|s_680|><|s_2666|><|s_1275|><|s_4433|><|s_2274|><|s_836|><|s_1397|><|s_3590|><|s_5087|><|s_4568|><|s_2292|><|s_3912|><|s_512|><|s_1295|><|s_546|><|s_2265|><|s_4538|><|s_5755|><|s_2867|><|s_2954|><|s_6048|><|s_846|><|s_1180|><|s_4315|><|s_2125|><|s_575|><|s_2678|><|s_5032|><|s_3411|>"
tts-1  | [10/09/2025-13:55:33] [TRT-LLM] [V] [153695, 155577, 152202, 156522, 156775, 157747, 157992, 155880, 152343, 154329, 152938, 156096, 153937, 152499, 153060, 155253, 156750, 156231, 153955, 155575, 152175, 152958, 152209, 153928, 156201, 157418, 154530, 154617, 157711, 152509, 152843, 155978, 153788, 152238, 154341, 156695, 155074]
tts-1  | [TensorRT-LLM][INFO] Refreshed the MPI local session
tts-1  | Creating model repository
tts-1  | Starting Triton server
tts-1  | I1009 13:55:37.095235 2432 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x204c00000' with size 268435456"
tts-1  | I1009 13:55:37.095499 2432 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
tts-1  | I1009 13:55:37.102713 2432 model_lifecycle.cc:473] "loading: audio_tokenizer:1"
tts-1  | I1009 13:55:37.102776 2432 model_lifecycle.cc:473] "loading: cosyvoice2:1"
tts-1  | I1009 13:55:37.102817 2432 model_lifecycle.cc:473] "loading: speaker_embedding:1"
tts-1  | I1009 13:55:37.102886 2432 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
tts-1  | I1009 13:55:37.102938 2432 model_lifecycle.cc:473] "loading: token2wav:1"
tts-1  | I1009 13:55:38.363103 2432 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
tts-1  | I1009 13:55:38.363189 2432 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
tts-1  | I1009 13:55:38.363197 2432 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
tts-1  | I1009 13:55:38.363203 2432 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"true\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
tts-1  | [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
tts-1  | [TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
tts-1  | I1009 13:55:38.375040 2432 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
tts-1  | [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
tts-1  | [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
tts-1  | [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
tts-1  | [TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
tts-1  | [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
tts-1  | [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
tts-1  | [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
tts-1  | [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
tts-1  | [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
tts-1  | [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
tts-1  | [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
tts-1  | [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
tts-1  | [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
tts-1  | [TensorRT-LLM][INFO] num_nodes is not specified, will be set to 1
tts-1  | [TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
tts-1  | [TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
tts-1  | [TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
tts-1  | [TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
tts-1  | [TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
tts-1  | [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa, redrafter, lookahead, eagle}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
tts-1  | [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
tts-1  | [TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
tts-1  | [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
tts-1  | [TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
tts-1  | [TensorRT-LLM][INFO] Initializing MPI with thread mode 3
tts-1  | [TensorRT-LLM][INFO] Initialized MPI
tts-1  | [TensorRT-LLM][INFO] Refreshed the MPI local session
tts-1  | [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
tts-1  | [TensorRT-LLM][INFO] Rank 0 is using GPU 0
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
tts-1  | [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2560) * 24
tts-1  | [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
tts-1  | [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 32768
tts-1  | [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 32767 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
tts-1  | [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
tts-1  | [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
tts-1  | [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
tts-1  | [TensorRT-LLM][INFO] Loaded engine size: 1237 MiB
tts-1  | [TensorRT-LLM][INFO] Engine load time 623 ms
tts-1  | [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
tts-1  | [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1024.03 MiB for execution context memory.
tts-1  | [TensorRT-LLM][INFO] gatherContextLogits: 0
tts-1  | [TensorRT-LLM][INFO] gatherGenerationLogits: 0
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1231 (MiB)
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 14.26 MB GPU memory for runtime buffers.
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 47.66 MB GPU memory for decoder.
tts-1  | [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 20.09 GiB, extraCostMemory: 0.00 GiB
tts-1  | [TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
tts-1  | [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 80
tts-1  | [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
tts-1  | [TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
tts-1  | [TensorRT-LLM][INFO] Max KV cache pages per sequence: 1024 [window size=2560]
tts-1  | [TensorRT-LLM][INFO] Number of tokens per block: 32.
tts-1  | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.03 GiB for max tokens in paged KV cache (2560).
tts-1  | [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
tts-1  | [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
tts-1  | I1009 13:55:39.395884 2432 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
tts-1  | I1009 13:55:39.396683 2432 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm'"
tts-1  | I1009 13:55:39.603512 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: speaker_embedding_0_0 (CPU device 0)"
tts-1  | I1009 13:55:39.615976 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: audio_tokenizer_0_0 (CPU device 0)"
tts-1  | 2025-10-09 13:55:40,479 INFO Converting onnx to trt...
tts-1  | [10/09/2025-13:55:40] [TRT] [I] [MemUsageChange] Init CUDA: CPU +29, GPU +0, now: CPU 150, GPU 1571 (MiB)
tts-1  | I1009 13:55:41.531841 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: token2wav_0_0 (CPU device 0)"
tts-1  | [10/09/2025-13:55:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1749, GPU +8, now: CPU 2101, GPU 1579 (MiB)
tts-1  | I1009 13:55:42.055643 2432 model_lifecycle.cc:849] "successfully loaded 'audio_tokenizer'"
tts-1  | 2025-10-09 13:55:42,470 INFO Initializing vocoder from ./CosyVoice2-0.5B on cuda:0
tts-1  | I1009 13:55:42.838400 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: cosyvoice2_0_0 (CPU device 0)"
tts-1  | I1009 13:55:42.838535 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: cosyvoice2_0_1 (CPU device 0)"
tts-1  | I1009 13:55:42.838745 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: cosyvoice2_0_2 (CPU device 0)"
tts-1  | I1009 13:55:42.838940 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: cosyvoice2_0_3 (CPU device 0)"
tts-1  | [10/09/2025-13:55:43] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
tts-1  | Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
tts-1  | I1009 13:55:46.207803 2432 model.py:68] "model_params:{'model_dir': './CosyVoice2-0.5B', 'llm_tokenizer_dir': './cosyvoice2_llm'}"
tts-1  | I1009 13:55:46.208030 2432 model.py:70] "Using dynamic chunk strategy: exponential"
tts-1  | I1009 13:55:46.235910 2432 model.py:68] "model_params:{'model_dir': './CosyVoice2-0.5B', 'llm_tokenizer_dir': './cosyvoice2_llm'}"
tts-1  | I1009 13:55:46.236112 2432 model.py:70] "Using dynamic chunk strategy: exponential"
tts-1  | I1009 13:55:46.677843 2432 model.py:68] "model_params:{'model_dir': './CosyVoice2-0.5B', 'llm_tokenizer_dir': './cosyvoice2_llm'}"
tts-1  | I1009 13:55:46.678040 2432 model.py:70] "Using dynamic chunk strategy: exponential"
tts-1  | /usr/local/lib/python3.12/dist-packages/diffusers/models/lora.py:393: FutureWarning: `LoRACompatibleLinear` is deprecated and will be removed in version 1.0.0. Use of `LoRACompatibleLinear` is deprecated. Please switch to PEFT backend by installing PEFT: `pip install peft`.
tts-1  |   deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
tts-1  | I1009 13:55:47.228192 2432 model.py:68] "model_params:{'model_dir': './CosyVoice2-0.5B', 'llm_tokenizer_dir': './cosyvoice2_llm'}"
tts-1  | I1009 13:55:47.228357 2432 model.py:70] "Using dynamic chunk strategy: exponential"
tts-1  | 2025-10-09 13:55:47,523 INFO input frame rate=25
tts-1  | I1009 13:55:47.539043 2432 model_lifecycle.cc:849] "successfully loaded 'cosyvoice2'"
tts-1  | /usr/local/lib/python3.12/dist-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
tts-1  |   WeightNorm.apply(module, name, dim)
tts-1  | 2025-10-09 13:55:49,329 INFO Converting onnx to trt...
tts-1  | [10/09/2025-13:55:49] [TRT] [I] [MemUsageChange] Init CUDA: CPU -2, GPU +0, now: CPU 3520, GPU 2085 (MiB)
tts-1  | [10/09/2025-13:55:50] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1168, GPU +8, now: CPU 4487, GPU 2093 (MiB)
tts-1  | [10/09/2025-13:55:53] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
tts-1  | [10/09/2025-13:55:57] [TRT] [I] Compiler backend is used during engine build.
tts-1  | [10/09/2025-13:56:14] [TRT] [I] Compiler backend is used during engine build.
tts-1  | [10/09/2025-13:56:28] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_1_0/block1/block/block_0/Pad_1290_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:29] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_2_0/block1/block/block_0/Pad_1865_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:31] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_3_0/block1/block/block_0/Pad_2440_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:33] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_4_0/block1/block/block_0/Pad_3015_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:34] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_5_0/block1/block/block_0/Pad_3590_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:36] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
tts-1  | [10/09/2025-13:56:36] [TRT] [I] Detected 1 inputs and 1 output network tensors.
tts-1  | [10/09/2025-13:56:36] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_6_0/block1/block/block_0/Pad_4165_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:37] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_7_0/block1/block/block_0/Pad_4740_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:38] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_8_0/block1/block/block_0/Pad_5315_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:39] [TRT] [I] Total Host Persistent Memory: 1427600 bytes
tts-1  | [10/09/2025-13:56:39] [TRT] [I] Total Device Persistent Memory: 0 bytes
tts-1  | [10/09/2025-13:56:39] [TRT] [I] Max Scratch Memory: 9216000 bytes
tts-1  | [10/09/2025-13:56:39] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1485 steps to complete.
tts-1  | [10/09/2025-13:56:39] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 111.264ms to assign 9 blocks to 1485 nodes requiring 66434048 bytes.
tts-1  | [10/09/2025-13:56:39] [TRT] [I] Total Activation Memory: 66433024 bytes
tts-1  | [10/09/2025-13:56:39] [TRT] [I] Total Weights Memory: 27425024 bytes
tts-1  | [10/09/2025-13:56:39] [TRT] [I] Compiler backend is used during engine execution.
tts-1  | [10/09/2025-13:56:39] [TRT] [I] Engine generation completed in 55.7007 seconds.
tts-1  | [10/09/2025-13:56:39] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 370 MiB
tts-1  | [10/09/2025-13:56:39] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_9_0/block1/block/block_0/Pad_5890_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | 2025-10-09 13:56:39,330 INFO Succesfully convert onnx to trt...
tts-1  | [10/09/2025-13:56:39] [TRT] [I] Loaded engine size: 30 MiB
tts-1  | [10/09/2025-13:56:39] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +2, GPU +63, now: CPU 2, GPU 89 (MiB)
tts-1  | I1009 13:56:39.660899 2432 model_lifecycle.cc:849] "successfully loaded 'speaker_embedding'"
tts-1  | [10/09/2025-13:56:40] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_10_0/block1/block/block_0/Pad_6465_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:41] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_11_0/block1/block/block_0/Pad_7040_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:42] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/up_blocks_0_0/block1/block/block_0/Pad_7653_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1  | [10/09/2025-13:56:57] [TRT] [I] Detected 6 inputs and 1 output network tensors.
tts-1  | [10/09/2025-13:56:59] [TRT] [I] Total Host Persistent Memory: 288976 bytes
tts-1  | [10/09/2025-13:56:59] [TRT] [I] Total Device Persistent Memory: 2048 bytes
tts-1  | [10/09/2025-13:56:59] [TRT] [I] Max Scratch Memory: 324924416 bytes
tts-1  | [10/09/2025-13:56:59] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 259 steps to complete.
tts-1  | [10/09/2025-13:56:59] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 6.05917ms to assign 10 blocks to 259 nodes requiring 346475008 bytes.
tts-1  | [10/09/2025-13:56:59] [TRT] [I] Total Activation Memory: 346473984 bytes
tts-1  | [10/09/2025-13:56:59] [TRT] [I] Total Weights Memory: 142611776 bytes
tts-1  | [10/09/2025-13:56:59] [TRT] [I] Compiler backend is used during engine execution.
tts-1  | [10/09/2025-13:56:59] [TRT] [I] Engine generation completed in 66.5267 seconds.
tts-1  | [10/09/2025-13:56:59] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 75 MiB, GPU 2308 MiB
tts-1  | 2025-10-09 13:56:59,696 INFO Succesfully convert onnx to trt...
tts-1  | [10/09/2025-13:56:59] [TRT] [I] Loaded engine size: 161 MiB
tts-1  | [10/09/2025-13:57:00] [TRT] [I] [MS] Running engine with multi stream info
tts-1  | [10/09/2025-13:57:00] [TRT] [I] [MS] Number of aux streams is 1
tts-1  | [10/09/2025-13:57:00] [TRT] [I] [MS] Number of total worker streams is 2
tts-1  | [10/09/2025-13:57:00] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream
tts-1  | [10/09/2025-13:57:00] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +330, now: CPU 0, GPU 466 (MiB)
tts-1  | 2025-10-09 13:57:00,615 INFO Token2Wav initialized successfully
tts-1  | I1009 13:57:00.618432 2432 model_lifecycle.cc:849] "successfully loaded 'token2wav'"
tts-1  | I1009 13:57:00.618974 2432 server.cc:611] 
tts-1  | +------------------+------+
tts-1  | | Repository Agent | Path |
tts-1  | +------------------+------+
tts-1  | +------------------+------+
tts-1  | 
tts-1  | I1009 13:57:00.619037 2432 server.cc:638] 
tts-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1  | | Backend     | Path                                                            | Config                                                                                                                                                        |
tts-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1  | | python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
tts-1  | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
tts-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1  | 
tts-1  | I1009 13:57:00.619077 2432 server.cc:681] 
tts-1  | +-------------------+---------+--------+
tts-1  | | Model             | Version | Status |
tts-1  | +-------------------+---------+--------+
tts-1  | | audio_tokenizer   | 1       | READY  |
tts-1  | | cosyvoice2        | 1       | READY  |
tts-1  | | speaker_embedding | 1       | READY  |
tts-1  | | tensorrt_llm      | 1       | READY  |
tts-1  | | token2wav         | 1       | READY  |
tts-1  | +-------------------+---------+--------+
tts-1  | 
tts-1  | I1009 13:57:00.676004 2432 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090"
tts-1  | I1009 13:57:00.678383 2432 metrics.cc:783] "Collecting CPU metrics"
tts-1  | I1009 13:57:00.678781 2432 tritonserver.cc:2598] 
tts-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1  | | Option                           | Value                                                                                                                                                                                                           |
tts-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1  | | server_id                        | triton                                                                                                                                                                                                          |
tts-1  | | server_version                   | 2.59.0                                                                                                                                                                                                          |
tts-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
tts-1  | | model_repository_path[0]         | ./model_repo_cosyvoice2                                                                                                                                                                                         |
tts-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
tts-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
tts-1  | | model_config_name                |                                                                                                                                                                                                                 |
tts-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
tts-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
tts-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
tts-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
tts-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
tts-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
tts-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
tts-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1  | 
tts-1  | I1009 13:57:00.692860 2432 grpc_server.cc:2562] "Started GRPCInferenceService at 0.0.0.0:8001"
tts-1  | I1009 13:57:00.693159 2432 http_server.cc:4832] "Started HTTPService at 0.0.0.0:8000"
tts-1  | I1009 13:57:00.734753 2432 http_server.cc:358] "Started Metrics Service at 0.0.0.0:8002"
tts-1  | W1009 13:57:30.688489 2432 python_be.cc:1759] "Failed to share CUDA memory pool with stub process: Failed to open the cudaIpcHandle. error: invalid resource handle. Will use CUDA IPC."
tts-1  | I1009 13:57:30.688588 2432 pb_stub.cc:1464] "Failed to initialize CUDA shared memory pool in Python stub: Failed to open the cudaIpcHandle. error: invalid resource handle"
tts-1  | E1009 13:57:30.688810 2432 pb_stub.cc:428] "An error occurred while trying to load GPU buffers in the Python backend stub: Failed to open the cudaIpcHandle. error: invalid resource handle\n"
tts-1  | W1009 13:57:30.837214 2432 python_be.cc:1759] "Failed to share CUDA memory pool with stub process: Failed to open the cudaIpcHandle. error: invalid resource handle. Will use CUDA IPC."
tts-1  | I1009 13:57:30.837186 2432 pb_stub.cc:1464] "Failed to initialize CUDA shared memory pool in Python stub: Failed to open the cudaIpcHandle. error: invalid resource handle"
tts-1  | E1009 13:57:30.837991 2432 pb_stub.cc:736] "Failed to process the request(s) for model 'cosyvoice2_0_1', message: TritonModelException: Failed to open the cudaIpcHandle. error: invalid resource handle\n\nAt:\n  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(193): forward_audio_tokenizer\n  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(336): execute\n"
tts-1  | E1009 13:57:34.597431 2432 pb_stub.cc:736] "Failed to process the request(s) for model 'audio_tokenizer_0_0', message: RuntimeError: CUDA error: invalid resource handle\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\n\nAt:\n  /usr/local/lib/python3.12/dist-packages/torch/nn/functional.py(5209): pad\n  /usr/local/lib/python3.12/dist-packages/torch/functional.py(714): stft\n  /usr/local/lib/python3.12/dist-packages/s3tokenizer/utils.py(258): log_mel_spectrogram\n  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/audio_tokenizer/1/model.py(82): execute\n"
tts-1  | E1009 13:57:34.597715 2432 pb_stub.cc:736] "Failed to process the request(s) for model 'cosyvoice2_0_3', message: TritonModelException: Failed to process the request(s) for model 'audio_tokenizer_0_0', message: RuntimeError: CUDA error: invalid resource handle\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\n\nAt:\n  /usr/local/lib/python3.12/dist-packages/torch/nn/functional.py(5209): pad\n  /usr/local/lib/python3.12/dist-packages/torch/functional.py(714): stft\n  /usr/local/lib/python3.12/dist-packages/s3tokenizer/utils.py(258): log_mel_spectrogram\n  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/audio_tokenizer/1/model.py(82): execute\n\n\nAt:\n  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(195): forward_audio_tokenizer\n  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(336): execute\n"

以下是rpc报错log

Initializing gRPC client for streaming mode...
task-0: Initializing sync client for streaming...
task-0: Starting streaming processing for 1 items.
task-0: Processing item 0/1
Received InferenceServerException: Failed to process the request(s) for model 'cosyvoice2_0_3', message: TritonModelException: Failed to process the request(s) for model 'audio_tokenizer_0_0', message: RuntimeError: CUDA error: invalid resource handle
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


At:
  /usr/local/lib/python3.12/dist-packages/torch/nn/functional.py(5209): pad
  /usr/local/lib/python3.12/dist-packages/torch/functional.py(714): stft
  /usr/local/lib/python3.12/dist-packages/s3tokenizer/utils.py(258): log_mel_spectrogram
  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/audio_tokenizer/1/model.py(82): execute


At:
  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(195): forward_audio_tokenizer
  /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(336): execute

task-0: Item 0 failed.
task-0: Closing sync client...
task-0: Finished streaming processing. Total duration synthesized: 0.0000s
Total synthesized duration is zero. Cannot calculate RTF or latency percentiles.
Mode: streaming
RTF: inf
total_duration: 0.000 seconds
(0.00 hours)
processing time: 0.035 seconds (0.00 hours)
No latency data collected.

Initializing temporary async client for fetching stats...
Fetching inference statistics...
Fetching model config...
Could not retrieve statistics or config: 'ns'
Closing temporary async stats client...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

使用docker TensorRT-LLM 进行RPC 流式推理测试。但是出现了CUDA句柄错误上下文错误。 #1599

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

使用docker TensorRT-LLM 进行RPC 流式推理测试。但是出现了CUDA句柄错误 上下文错误。 #1599

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

使用docker TensorRT-LLM 进行RPC 流式推理测试。但是出现了CUDA句柄错误上下文错误。 #1599