Skip to content

使用docker TensorRT-LLM 进行RPC 流式推理测试。但是出现了CUDA句柄错误 上下文错误。 #1599

@salier

Description

@salier

如题,使用WSL ubuntu24.04 4090 24G CUDA12.9
代码没有修改,采用流式测试,进行rpc测试 ,但是出现了cuda句柄错误 上下文错误。
以下是初始化+服务后端log

(Cenv) serial@sariel:/workserver/TensorRTC$ docker compose up -d
WARN[0000] The "MODEL_ID" variable is not set. Defaulting to a blank string.
[+] Running 1/1
✔ Container tensorrtc-tts-1 Started 0.3s
(Cenv) serial@sariel:
/workserver/TensorRTC$ docker compose up
WARN[0000] The "MODEL_ID" variable is not set. Defaulting to a blank string.
[+] Running 1/1
✔ Container tensorrtc-tts-1 Running 0.0s
Attaching to tts-1
tts-1 | Submodule 'third_party/Matcha-TTS' (https://github.com/shivammehta25/Matcha-TTS.git) registered for path 'third_party/Matcha-TTS'
tts-1 | Cloning into '/workspace/CosyVoice/third_party/Matcha-TTS'...
tts-1 | Submodule path 'third_party/Matcha-TTS': checked out 'dd9105b34bf2be2230f4aa1e4769fb586a3c824e'
tts-1 | Downloading CosyVoice2-0.5B
Fetching 10 files: 0%| | 0/10 [00:00<?, ?it/s]Downloading 'model.safetensors' to 'cosyvoice2_llm/.cache/huggingface/download/xGOKKLRSlIhH692hSVvI1-gpoa8=.91d53bdc8f6bf5752f40e1400305a64f6c8a8e335336ea0f5d5eaac8da974050.incomplete'
tts-1 | Downloading 'tokenizer.json' to 'cosyvoice2_llm/.cache/huggingface/download/HgM_lKo9sdSCfRtVg7MMFS7EKqo=.bf18af7d528d3912841a5cc768a7948f7c8565b2f6dbf8ee3d02e6cf58df98fc.incomplete'
tts-1 | Downloading 'merges.txt' to 'cosyvoice2_llm/.cache/huggingface/download/PtHk0z_I45atnj23IIRhTExwT3w=.6ed63830772e0c3879f54f26b056a9b2bf5ad8f4.incomplete'
tts-1 | Downloading 'generation_config.json' to 'cosyvoice2_llm/.cache/huggingface/download/3EVKVggOldJcKSsGjSdoUCN1AyQ=.2845e56f291aa8ab6c1d5db1509a78ebe5f809e5.incomplete'
tts-1 | Downloading 'special_tokens_map.json' to 'cosyvoice2_llm/.cache/huggingface/download/ahkChHUJFxEmOdq5GDFEmerRzCY=.b023bc245a90c287c1c2e3459d1c9f7e28eb1bea.incomplete'
tts-1 | Downloading 'added_tokens.json' to 'cosyvoice2_llm/.cache/huggingface/download/SeqzFlf9ZNZ3or_wZAOIdsM3Yxw=.969a9654d282db3f5763b5a8129cd6653428e486.incomplete'
tts-1 | Downloading '.gitattributes' to 'cosyvoice2_llm/.cache/huggingface/download/wPaCkH-WbT7GsmxMKKrNZTV4nSM=.52373fe24473b1aa44333d318f578ae6bf04b49b.incomplete'
tts-1 | Downloading 'config.json' to 'cosyvoice2_llm/.cache/huggingface/download/8_PA_wEVGiVa2goH2H4KQOQpvVY=.cfb66b8d1dd57d8f40492c0da59d727029542520.incomplete'
tts-1 | Download complete. Moving file to cosyvoice2_llm/.gitattributes
Fetching 10 files: 10%|█ | 1/10 [00:08<01:13, 8.11s/it]Download complete. Moving file to cosyvoice2_llm/special_tokens_map.json
tts-1 | Download complete. Moving file to cosyvoice2_llm/config.json
tts-1 | Download complete. Moving file to cosyvoice2_llm/added_tokens.json
Fetching 10 files: 20%|██ | 2/10 [00:08<00:27, 3.45s/it]Download complete. Moving file to cosyvoice2_llm/generation_config.json
Fetching 10 files: 40%|████ | 4/10 [00:08<00:08, 1.37s/it]Downloading 'vocab.json' to 'cosyvoice2_llm/.cache/huggingface/download/j3m-Hy6QvBddw8RXA1uSWl1AJ0c=.4783fe10ac3adce15ac8f358ef5462739852c569.incomplete'
tts-1 | Downloading 'tokenizer_config.json' to 'cosyvoice2_llm/.cache/huggingface/download/vzaExXFZNBay89bvlQv-ZcI6BTg=.6c4f39ba12fba8370691a7004688316d1829e158.incomplete'
tts-1 | Download complete. Moving file to cosyvoice2_llm/tokenizer_config.json
tts-1 | Download complete. Moving file to cosyvoice2_llm/merges.txt
Fetching 10 files: 50%|█████ | 5/10 [00:09<00:05, 1.13s/it]Download complete. Moving file to cosyvoice2_llm/vocab.json
tts-1 | Download complete. Moving file to cosyvoice2_llm/tokenizer.json
tts-1 | Download complete. Moving file to cosyvoice2_llm/model.safetensors
Fetching 10 files: 100%|██████████| 10/10 [00:22<00:00, 2.24s/it]
tts-1 | /workspace/CosyVoice/runtime/triton_trtllm/cosyvoice2_llm
tts-1 |
tts-1 | _ .-') _ .-') _ ('-. .-') _ (-. ('-. tts-1 | ( '.( OO )_ ( ( OO) ) _( OO) ( OO ). ( (OO ) _( OO) tts-1 | ,--. ,--.).-'),-----. \ .'_ (,------.,--. (_)---\_) .-----. .-'),-----. _. (,------.
tts-1 | | .' |( OO' .-. ','--...) | .---'| |.-') / _ | ' .--./ ( OO' .-. '(__...--'' | .---'
tts-1 | | |/ | | | || | \ ' | | | | OO )\ : . | |('-. / | | | | | / | | | |
tts-1 | | |'.'| |_) | || || | ' |(| '--. | |-' | '..''.) /
) |OO )_) | || | | |.' |(| '--.
tts-1 | | | | | \ | | | || | / : | .--'(| '---.'.-.
) \ || |-'| \ | | | | | .___.' | .--' tts-1 | | | | | ' '-' '| '--' / | ---.| | \ /(_' '--'\ ' '-' ' | | | ---. tts-1 | --' --' -----' -------' ------'------' -----' -----' -----' --' ------'
tts-1 |
tts-1 | Downloading Model from https://www.modelscope.cn to directory: /workspace/CosyVoice/runtime/triton_trtllm/CosyVoice2-0.5B
Downloading [CosyVoice-BlankEN/config.json]: 100%|██████████| 659/659 [00:03<00:00, 203B/s]
Downloading [configuration.json]: 100%|██████████| 47.0/47.0 [00:03<00:00, 13.9B/s]
Downloading [asset/dingding.png]: 100%|██████████| 94.1k/94.1k [00:03<00:00, 26.1kB/s]
Downloading [campplus.onnx]: 100%|██████████| 27.0M/27.0M [00:06<00:00, 4.16MB/s]
Downloading [cosyvoice2.yaml]: 100%|██████████| 7.16k/7.16k [00:08<00:00, 891B/s]
Downloading [CosyVoice-BlankEN/generation_config.json]: 100%|██████████| 242/242 [00:07<00:00, 31.8B/s]
Downloading [flow.encoder.fp16.zip]: 100%|██████████| 111M/111M [00:14<00:00, 7.80MB/s]
Downloading [CosyVoice-BlankEN/merges.txt]: 100%|██████████| 1.34M/1.34M [00:03<00:00, 373kB/s]
Downloading [hift.pt]: 100%|██████████| 79.5M/79.5M [00:13<00:00, 6.01MB/s]
Downloading [README.md]: 100%|██████████| 11.8k/11.8k [00:05<00:00, 2.05kB/s]
Downloading [CosyVoice-BlankEN/tokenizer_config.json]: 100%|██████████| 1.26k/1.26k [00:04<00:00, 310B/s]
Downloading [flow.encoder.fp32.zip]: 100%|██████████| 183M/183M [00:22<00:00, 8.63MB/s]
Downloading [CosyVoice-BlankEN/vocab.json]: 100%|██████████| 2.65M/2.65M [00:02<00:00, 1.24MB/s]
Downloading [flow.decoder.estimator.fp32.onnx]: 100%|██████████| 273M/273M [00:27<00:00, 10.5MB/s]
Downloading [flow.pt]: 100%|██████████| 430M/430M [00:29<00:00, 15.3MB/s]
Downloading [flow.cache.pt]: 100%|██████████| 430M/430M [00:41<00:00, 10.8MB/s]
Downloading [speech_tokenizer_v2.onnx]: 100%|██████████| 473M/473M [00:35<00:00, 14.1MB/s]
Downloading [CosyVoice-BlankEN/model.safetensors]: 100%|██████████| 942M/942M [00:46<00:00, 21.3MB/s]
Downloading [llm.pt]: 100%|██████████| 1.88G/1.88G [01:42<00:00, 19.8MB/s]
Processing 19 items: 100%|██████████| 19.0/19.0 [01:50<00:00, 5.82s/it]
tts-1 |
tts-1 | Successfully Downloaded from model iic/CosyVoice2-0.5B.
tts-1 |
tts-1 | --2025-10-09 13:54:57-- https://raw.githubusercontent.com/qi-hua/async_cosyvoice/main/CosyVoice2-0.5B/spk2info.pt
tts-1 | Connecting to 10.1.20.70:20809... connected.
tts-1 | Proxy request sent, awaiting response... 200 OK
tts-1 | Length: 180930 (177K) [application/octet-stream]
tts-1 | Saving to: './CosyVoice2-0.5B/spk2info.pt'
tts-1 |
tts-1 | 0K .......... .......... .......... .......... .......... 28% 149K 1s
tts-1 | 50K .......... .......... .......... .......... .......... 56% 435K 0s
tts-1 | 100K .......... .......... .......... .......... .......... 84% 715K 0s
tts-1 | 150K .......... .......... ...... 100% 3.28M=0.5s
tts-1 |
tts-1 | 2025-10-09 13:55:01 (334 KB/s) - './CosyVoice2-0.5B/spk2info.pt' saved [180930/180930]
tts-1 |
tts-1 | Converting checkpoint to TensorRT weights
tts-1 | :1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
tts-1 | :1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
tts-1 | 2025-10-09 13:55:08,748 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
tts-1 | /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2330: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
tts-1 | If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
tts-1 | warnings.warn(
tts-1 | [TensorRT-LLM] TensorRT-LLM version: 0.20.0
tts-1 | 0.20.0
198it [00:00, 1866.69it/s]
tts-1 | Total time of converting checkpoints: 00:00:02
tts-1 | Building TensorRT engines
tts-1 | :1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
tts-1 | :1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
tts-1 | 2025-10-09 13:55:15,704 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
tts-1 | /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2330: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
tts-1 | If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
tts-1 | warnings.warn(
tts-1 | [TensorRT-LLM] TensorRT-LLM version: 0.20.0
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set bert_attention_plugin to auto.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set nccl_plugin to auto.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set lora_plugin to None.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set dora_plugin to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set moe_plugin to auto.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set context_fmha to True.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set remove_input_padding to True.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set norm_quant_fusion to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set reduce_fusion to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set user_buffer to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set tokens_per_block to 32.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set use_paged_context_fmha to True.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set fuse_fp4_quant to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set multiple_profiles to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set paged_state to True.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set streamingllm to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set use_fused_mlp to True.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set pp_reduce_scatter to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Compute capability: (8, 9)
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] SM count: 128
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] SM clock: 3105 MHz
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] int4 TFLOPS: 813
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] int8 TFLOPS: 406
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] fp8 TFLOPS: 406
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] float16 TFLOPS: 203
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] bfloat16 TFLOPS: 203
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] float32 TFLOPS: 101
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Total Memory: 23 GiB
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Memory clock: 10501 MHz
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Memory bus width: 384
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Memory bandwidth: 1008 GB/s
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] PCIe speed: 16000 Mbps
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] PCIe link width: 16
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] PCIe bandwidth: 32 GB/s
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] Provided but not required tensors: {'rotary_inv_freq', 'embed_positions_for_gpt_attention', 'embed_positions'}
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set dtype to bfloat16.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set paged_kv_cache to True.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] Overriding paged_state to False
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set paged_state to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 32768
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
tts-1 |
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] Specifying a max_num_tokens larger than 16384 is usually not recommended, we do not expect perf gain with that and too large max_num_tokens could possibly exceed the TensorRT tensor volume, causing runtime errors. Got max_num_tokens = 32768
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
tts-1 | [10/09/2025-13:55:15] [TRT-LLM] [W] FP8 Context FMHA is disabled because it must be used together with the fp8 quantization workflow.
tts-1 | [10/09/2025-13:55:16] [TRT] [I] [MemUsageChange] Init CUDA: CPU +20, GPU +0, now: CPU 662, GPU 1571 (MiB)
tts-1 | [10/09/2025-13:55:19] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1746, GPU +8, now: CPU 2610, GPU 1579 (MiB)
tts-1 | [10/09/2025-13:55:19] [TRT-LLM] [I] Set nccl_plugin to None.
tts-1 | [10/09/2025-13:55:19] [TRT-LLM] [I] Total time of constructing network from module object 3.763633966445923 seconds
tts-1 | [10/09/2025-13:55:19] [TRT-LLM] [I] Total optimization profiles added: 1
tts-1 | [10/09/2025-13:55:19] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
tts-1 | [10/09/2025-13:55:19] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
tts-1 | [10/09/2025-13:55:19] [TRT] [W] Unused Input: position_ids
tts-1 | [10/09/2025-13:55:21] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
tts-1 | [10/09/2025-13:55:21] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
tts-1 | [10/09/2025-13:55:21] [TRT] [I] Compiler backend is used during engine build.
tts-1 | [10/09/2025-13:55:23] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
tts-1 | [10/09/2025-13:55:23] [TRT] [I] Detected 17 inputs and 1 output network tensors.
tts-1 | [10/09/2025-13:55:24] [TRT] [I] Total Host Persistent Memory: 77248 bytes
tts-1 | [10/09/2025-13:55:24] [TRT] [I] Total Device Persistent Memory: 0 bytes
tts-1 | [10/09/2025-13:55:24] [TRT] [I] Max Scratch Memory: 140786688 bytes
tts-1 | [10/09/2025-13:55:24] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 402 steps to complete.
tts-1 | [10/09/2025-13:55:24] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 10.9448ms to assign 18 blocks to 402 nodes requiring 1073777152 bytes.
tts-1 | [10/09/2025-13:55:24] [TRT] [I] Total Activation Memory: 1073776640 bytes
tts-1 | [10/09/2025-13:55:24] [TRT] [I] Total Weights Memory: 1291661568 bytes
tts-1 | [10/09/2025-13:55:24] [TRT] [I] Compiler backend is used during engine execution.
tts-1 | [10/09/2025-13:55:24] [TRT] [I] Engine generation completed in 3.35634 seconds.
tts-1 | [10/09/2025-13:55:24] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 1232 MiB
tts-1 | [10/09/2025-13:55:25] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:06
tts-1 | [10/09/2025-13:55:25] [TRT] [I] Serialized 27 bytes of code generator cache.
tts-1 | [10/09/2025-13:55:25] [TRT] [I] Serialized 131417 bytes of compilation cache.
tts-1 | [10/09/2025-13:55:25] [TRT] [I] Serialized 9 timing cache entries
tts-1 | [10/09/2025-13:55:25] [TRT-LLM] [I] Timing cache serialized to model.cache
tts-1 | [10/09/2025-13:55:25] [TRT-LLM] [I] Build phase peak memory: 9575.57 MB, children: 16.25 MB
tts-1 | [10/09/2025-13:55:25] [TRT-LLM] [I] Serializing engine to ./trt_engines_bfloat16/rank0.engine...
tts-1 | [10/09/2025-13:55:26] [TRT-LLM] [I] Engine serialized. Total time: 00:00:00
tts-1 | [10/09/2025-13:55:26] [TRT-LLM] [I] Total time of building all engines: 00:00:11
tts-1 | Testing TensorRT engines
tts-1 | :1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
tts-1 | :1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
tts-1 | 2025-10-09 13:55:31,682 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
tts-1 | /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2330: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
tts-1 | If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
tts-1 | warnings.warn(
tts-1 | [TensorRT-LLM] TensorRT-LLM version: 0.20.0
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [V] Input token ids (batch_size = 1):
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [V] Request 0: [158227, 56568, 52801, 37945, 56007, 56568, 99882, 99217, 81596, 11319, 158228]
tts-1 | [TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
tts-1 | [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set dtype to bfloat16.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set bert_attention_plugin to auto.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set qserve_gemm_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set identity_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set nccl_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set lora_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set dora_plugin to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set smooth_quant_plugins to True.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set moe_plugin to auto.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set context_fmha to True.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set paged_kv_cache to True.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set remove_input_padding to True.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set norm_quant_fusion to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set reduce_fusion to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set user_buffer to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set tokens_per_block to 32.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set use_paged_context_fmha to True.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set fuse_fp4_quant to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set multiple_profiles to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set paged_state to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set streamingllm to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set manage_weights to False.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set use_fused_mlp to True.
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Set pp_reduce_scatter to False.
tts-1 | [TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
tts-1 | [TensorRT-LLM][INFO] Refreshed the MPI local session
tts-1 | [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
tts-1 | [TensorRT-LLM][INFO] Rank 0 is using GPU 0
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
tts-1 | [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 24
tts-1 | [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
tts-1 | [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 32768
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 32767 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
tts-1 | [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
tts-1 | [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
tts-1 | [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
tts-1 | [TensorRT-LLM][INFO] Loaded engine size: 1237 MiB
tts-1 | [TensorRT-LLM][INFO] Engine load time 405 ms
tts-1 | [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
tts-1 | [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1024.03 MiB for execution context memory.
tts-1 | [TensorRT-LLM][INFO] gatherContextLogits: 0
tts-1 | [TensorRT-LLM][INFO] gatherGenerationLogits: 0
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1231 (MiB)
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.35 MB GPU memory for runtime buffers.
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.98 MB GPU memory for decoder.
tts-1 | [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 20.21 GiB, extraCostMemory: 0.00 GiB
tts-1 | [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 33110
tts-1 | [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
tts-1 | [TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
tts-1 | [TensorRT-LLM][INFO] Max KV cache pages per sequence: 1024 [window size=32768]
tts-1 | [TensorRT-LLM][INFO] Number of tokens per block: 32.
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 12.13 GiB for max tokens in paged KV cache (1059520).
tts-1 | [10/09/2025-13:55:32] [TRT-LLM] [I] Load engine takes: 0.9719598293304443 sec
tts-1 | Input [Text 0]: "<|sos|>你好,请问你叫什么?<|task_id|>"
tts-1 | Output [Text 0]: "<|s_2032|><|s_3914|><|s_539|><|s_4859|><|s_5112|><|s_6084|><|s_6329|><|s_4217|><|s_680|><|s_2666|><|s_1275|><|s_4433|><|s_2274|><|s_836|><|s_1397|><|s_3590|><|s_5087|><|s_4568|><|s_2292|><|s_3912|><|s_512|><|s_1295|><|s_546|><|s_2265|><|s_4538|><|s_5755|><|s_2867|><|s_2954|><|s_6048|><|s_846|><|s_1180|><|s_4315|><|s_2125|><|s_575|><|s_2678|><|s_5032|><|s_3411|>"
tts-1 | [10/09/2025-13:55:33] [TRT-LLM] [V] [153695, 155577, 152202, 156522, 156775, 157747, 157992, 155880, 152343, 154329, 152938, 156096, 153937, 152499, 153060, 155253, 156750, 156231, 153955, 155575, 152175, 152958, 152209, 153928, 156201, 157418, 154530, 154617, 157711, 152509, 152843, 155978, 153788, 152238, 154341, 156695, 155074]
tts-1 | [TensorRT-LLM][INFO] Refreshed the MPI local session
tts-1 | Creating model repository
tts-1 | Starting Triton server
tts-1 | I1009 13:55:37.095235 2432 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x204c00000' with size 268435456"
tts-1 | I1009 13:55:37.095499 2432 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
tts-1 | I1009 13:55:37.102713 2432 model_lifecycle.cc:473] "loading: audio_tokenizer:1"
tts-1 | I1009 13:55:37.102776 2432 model_lifecycle.cc:473] "loading: cosyvoice2:1"
tts-1 | I1009 13:55:37.102817 2432 model_lifecycle.cc:473] "loading: speaker_embedding:1"
tts-1 | I1009 13:55:37.102886 2432 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
tts-1 | I1009 13:55:37.102938 2432 model_lifecycle.cc:473] "loading: token2wav:1"
tts-1 | I1009 13:55:38.363103 2432 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
tts-1 | I1009 13:55:38.363189 2432 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
tts-1 | I1009 13:55:38.363197 2432 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
tts-1 | I1009 13:55:38.363203 2432 libtensorrtllm.cc:86] "backend configuration:\n{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}"
tts-1 | [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
tts-1 | [TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
tts-1 | I1009 13:55:38.375040 2432 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
tts-1 | [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
tts-1 | [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
tts-1 | [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
tts-1 | [TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
tts-1 | [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
tts-1 | [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
tts-1 | [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
tts-1 | [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
tts-1 | [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
tts-1 | [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
tts-1 | [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
tts-1 | [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
tts-1 | [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
tts-1 | [TensorRT-LLM][INFO] num_nodes is not specified, will be set to 1
tts-1 | [TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
tts-1 | [TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
tts-1 | [TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
tts-1 | [TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
tts-1 | [TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
tts-1 | [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa, redrafter, lookahead, eagle}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
tts-1 | [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
tts-1 | [TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
tts-1 | [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
tts-1 | [TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
tts-1 | [TensorRT-LLM][INFO] Initializing MPI with thread mode 3
tts-1 | [TensorRT-LLM][INFO] Initialized MPI
tts-1 | [TensorRT-LLM][INFO] Refreshed the MPI local session
tts-1 | [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
tts-1 | [TensorRT-LLM][INFO] Rank 0 is using GPU 0
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
tts-1 | [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2560) * 24
tts-1 | [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
tts-1 | [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 32768
tts-1 | [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 32767 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
tts-1 | [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
tts-1 | [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
tts-1 | [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
tts-1 | [TensorRT-LLM][INFO] Loaded engine size: 1237 MiB
tts-1 | [TensorRT-LLM][INFO] Engine load time 623 ms
tts-1 | [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
tts-1 | [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1024.03 MiB for execution context memory.
tts-1 | [TensorRT-LLM][INFO] gatherContextLogits: 0
tts-1 | [TensorRT-LLM][INFO] gatherGenerationLogits: 0
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1231 (MiB)
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 14.26 MB GPU memory for runtime buffers.
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 47.66 MB GPU memory for decoder.
tts-1 | [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 23.99 GiB, available: 20.09 GiB, extraCostMemory: 0.00 GiB
tts-1 | [TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
tts-1 | [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 80
tts-1 | [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
tts-1 | [TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
tts-1 | [TensorRT-LLM][INFO] Max KV cache pages per sequence: 1024 [window size=2560]
tts-1 | [TensorRT-LLM][INFO] Number of tokens per block: 32.
tts-1 | [TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.03 GiB for max tokens in paged KV cache (2560).
tts-1 | [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
tts-1 | [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
tts-1 | I1009 13:55:39.395884 2432 libtensorrtllm.cc:184] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_0_0"
tts-1 | I1009 13:55:39.396683 2432 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm'"
tts-1 | I1009 13:55:39.603512 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: speaker_embedding_0_0 (CPU device 0)"
tts-1 | I1009 13:55:39.615976 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: audio_tokenizer_0_0 (CPU device 0)"
tts-1 | 2025-10-09 13:55:40,479 INFO Converting onnx to trt...
tts-1 | [10/09/2025-13:55:40] [TRT] [I] [MemUsageChange] Init CUDA: CPU +29, GPU +0, now: CPU 150, GPU 1571 (MiB)
tts-1 | I1009 13:55:41.531841 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: token2wav_0_0 (CPU device 0)"
tts-1 | [10/09/2025-13:55:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1749, GPU +8, now: CPU 2101, GPU 1579 (MiB)
tts-1 | I1009 13:55:42.055643 2432 model_lifecycle.cc:849] "successfully loaded 'audio_tokenizer'"
tts-1 | 2025-10-09 13:55:42,470 INFO Initializing vocoder from ./CosyVoice2-0.5B on cuda:0
tts-1 | I1009 13:55:42.838400 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: cosyvoice2_0_0 (CPU device 0)"
tts-1 | I1009 13:55:42.838535 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: cosyvoice2_0_1 (CPU device 0)"
tts-1 | I1009 13:55:42.838745 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: cosyvoice2_0_2 (CPU device 0)"
tts-1 | I1009 13:55:42.838940 2432 python_be.cc:2289] "TRITONBACKEND_ModelInstanceInitialize: cosyvoice2_0_3 (CPU device 0)"
tts-1 | [10/09/2025-13:55:43] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
tts-1 | Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.
tts-1 | I1009 13:55:46.207803 2432 model.py:68] "model_params:{'model_dir': './CosyVoice2-0.5B', 'llm_tokenizer_dir': './cosyvoice2_llm'}"
tts-1 | I1009 13:55:46.208030 2432 model.py:70] "Using dynamic chunk strategy: exponential"
tts-1 | I1009 13:55:46.235910 2432 model.py:68] "model_params:{'model_dir': './CosyVoice2-0.5B', 'llm_tokenizer_dir': './cosyvoice2_llm'}"
tts-1 | I1009 13:55:46.236112 2432 model.py:70] "Using dynamic chunk strategy: exponential"
tts-1 | I1009 13:55:46.677843 2432 model.py:68] "model_params:{'model_dir': './CosyVoice2-0.5B', 'llm_tokenizer_dir': './cosyvoice2_llm'}"
tts-1 | I1009 13:55:46.678040 2432 model.py:70] "Using dynamic chunk strategy: exponential"
tts-1 | /usr/local/lib/python3.12/dist-packages/diffusers/models/lora.py:393: FutureWarning: LoRACompatibleLinear is deprecated and will be removed in version 1.0.0. Use of LoRACompatibleLinear is deprecated. Please switch to PEFT backend by installing PEFT: pip install peft.
tts-1 | deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
tts-1 | I1009 13:55:47.228192 2432 model.py:68] "model_params:{'model_dir': './CosyVoice2-0.5B', 'llm_tokenizer_dir': './cosyvoice2_llm'}"
tts-1 | I1009 13:55:47.228357 2432 model.py:70] "Using dynamic chunk strategy: exponential"
tts-1 | 2025-10-09 13:55:47,523 INFO input frame rate=25
tts-1 | I1009 13:55:47.539043 2432 model_lifecycle.cc:849] "successfully loaded 'cosyvoice2'"
tts-1 | /usr/local/lib/python3.12/dist-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
tts-1 | WeightNorm.apply(module, name, dim)
tts-1 | 2025-10-09 13:55:49,329 INFO Converting onnx to trt...
tts-1 | [10/09/2025-13:55:49] [TRT] [I] [MemUsageChange] Init CUDA: CPU -2, GPU +0, now: CPU 3520, GPU 2085 (MiB)
tts-1 | [10/09/2025-13:55:50] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1168, GPU +8, now: CPU 4487, GPU 2093 (MiB)
tts-1 | [10/09/2025-13:55:53] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
tts-1 | [10/09/2025-13:55:57] [TRT] [I] Compiler backend is used during engine build.
tts-1 | [10/09/2025-13:56:14] [TRT] [I] Compiler backend is used during engine build.
tts-1 | [10/09/2025-13:56:28] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_1_0/block1/block/block_0/Pad_1290_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:29] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_2_0/block1/block/block_0/Pad_1865_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:31] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_3_0/block1/block/block_0/Pad_2440_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:33] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_4_0/block1/block/block_0/Pad_3015_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:34] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_5_0/block1/block/block_0/Pad_3590_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:36] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
tts-1 | [10/09/2025-13:56:36] [TRT] [I] Detected 1 inputs and 1 output network tensors.
tts-1 | [10/09/2025-13:56:36] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_6_0/block1/block/block_0/Pad_4165_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:37] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_7_0/block1/block/block_0/Pad_4740_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:38] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_8_0/block1/block/block_0/Pad_5315_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:39] [TRT] [I] Total Host Persistent Memory: 1427600 bytes
tts-1 | [10/09/2025-13:56:39] [TRT] [I] Total Device Persistent Memory: 0 bytes
tts-1 | [10/09/2025-13:56:39] [TRT] [I] Max Scratch Memory: 9216000 bytes
tts-1 | [10/09/2025-13:56:39] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1485 steps to complete.
tts-1 | [10/09/2025-13:56:39] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 111.264ms to assign 9 blocks to 1485 nodes requiring 66434048 bytes.
tts-1 | [10/09/2025-13:56:39] [TRT] [I] Total Activation Memory: 66433024 bytes
tts-1 | [10/09/2025-13:56:39] [TRT] [I] Total Weights Memory: 27425024 bytes
tts-1 | [10/09/2025-13:56:39] [TRT] [I] Compiler backend is used during engine execution.
tts-1 | [10/09/2025-13:56:39] [TRT] [I] Engine generation completed in 55.7007 seconds.
tts-1 | [10/09/2025-13:56:39] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 370 MiB
tts-1 | [10/09/2025-13:56:39] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_9_0/block1/block/block_0/Pad_5890_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | 2025-10-09 13:56:39,330 INFO Succesfully convert onnx to trt...
tts-1 | [10/09/2025-13:56:39] [TRT] [I] Loaded engine size: 30 MiB
tts-1 | [10/09/2025-13:56:39] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +2, GPU +63, now: CPU 2, GPU 89 (MiB)
tts-1 | I1009 13:56:39.660899 2432 model_lifecycle.cc:849] "successfully loaded 'speaker_embedding'"
tts-1 | [10/09/2025-13:56:40] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_10_0/block1/block/block_0/Pad_6465_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:41] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/mid_blocks_11_0/block1/block/block_0/Pad_7040_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:42] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [api_compile.cpp:validate_copy_operation:4682] Slice operation "/up_blocks_0_0/block1/block/block_0/Pad_7653_slice" has incorrect fill value type, slice op requires fill value type to be same as its input.
tts-1 | [10/09/2025-13:56:57] [TRT] [I] Detected 6 inputs and 1 output network tensors.
tts-1 | [10/09/2025-13:56:59] [TRT] [I] Total Host Persistent Memory: 288976 bytes
tts-1 | [10/09/2025-13:56:59] [TRT] [I] Total Device Persistent Memory: 2048 bytes
tts-1 | [10/09/2025-13:56:59] [TRT] [I] Max Scratch Memory: 324924416 bytes
tts-1 | [10/09/2025-13:56:59] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 259 steps to complete.
tts-1 | [10/09/2025-13:56:59] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 6.05917ms to assign 10 blocks to 259 nodes requiring 346475008 bytes.
tts-1 | [10/09/2025-13:56:59] [TRT] [I] Total Activation Memory: 346473984 bytes
tts-1 | [10/09/2025-13:56:59] [TRT] [I] Total Weights Memory: 142611776 bytes
tts-1 | [10/09/2025-13:56:59] [TRT] [I] Compiler backend is used during engine execution.
tts-1 | [10/09/2025-13:56:59] [TRT] [I] Engine generation completed in 66.5267 seconds.
tts-1 | [10/09/2025-13:56:59] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 75 MiB, GPU 2308 MiB
tts-1 | 2025-10-09 13:56:59,696 INFO Succesfully convert onnx to trt...
tts-1 | [10/09/2025-13:56:59] [TRT] [I] Loaded engine size: 161 MiB
tts-1 | [10/09/2025-13:57:00] [TRT] [I] [MS] Running engine with multi stream info
tts-1 | [10/09/2025-13:57:00] [TRT] [I] [MS] Number of aux streams is 1
tts-1 | [10/09/2025-13:57:00] [TRT] [I] [MS] Number of total worker streams is 2
tts-1 | [10/09/2025-13:57:00] [TRT] [I] [MS] The main stream provided by execute/enqueue calls is the first worker stream
tts-1 | [10/09/2025-13:57:00] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +330, now: CPU 0, GPU 466 (MiB)
tts-1 | 2025-10-09 13:57:00,615 INFO Token2Wav initialized successfully
tts-1 | I1009 13:57:00.618432 2432 model_lifecycle.cc:849] "successfully loaded 'token2wav'"
tts-1 | I1009 13:57:00.618974 2432 server.cc:611]
tts-1 | +------------------+------+
tts-1 | | Repository Agent | Path |
tts-1 | +------------------+------+
tts-1 | +------------------+------+
tts-1 |
tts-1 | I1009 13:57:00.619037 2432 server.cc:638]
tts-1 | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1 | | Backend | Path | Config |
tts-1 | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1 | | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
tts-1 | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
tts-1 | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1 |
tts-1 | I1009 13:57:00.619077 2432 server.cc:681]
tts-1 | +-------------------+---------+--------+
tts-1 | | Model | Version | Status |
tts-1 | +-------------------+---------+--------+
tts-1 | | audio_tokenizer | 1 | READY |
tts-1 | | cosyvoice2 | 1 | READY |
tts-1 | | speaker_embedding | 1 | READY |
tts-1 | | tensorrt_llm | 1 | READY |
tts-1 | | token2wav | 1 | READY |
tts-1 | +-------------------+---------+--------+
tts-1 |
tts-1 | I1009 13:57:00.676004 2432 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090"
tts-1 | I1009 13:57:00.678383 2432 metrics.cc:783] "Collecting CPU metrics"
tts-1 | I1009 13:57:00.678781 2432 tritonserver.cc:2598]
tts-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1 | | Option | Value |
tts-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1 | | server_id | triton |
tts-1 | | server_version | 2.59.0 |
tts-1 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
tts-1 | | model_repository_path[0] | ./model_repo_cosyvoice2 |
tts-1 | | model_control_mode | MODE_NONE |
tts-1 | | strict_model_config | 0 |
tts-1 | | model_config_name | |
tts-1 | | rate_limit | OFF |
tts-1 | | pinned_memory_pool_byte_size | 268435456 |
tts-1 | | cuda_memory_pool_byte_size{0} | 67108864 |
tts-1 | | min_supported_compute_capability | 6.0 |
tts-1 | | strict_readiness | 1 |
tts-1 | | exit_timeout | 30 |
tts-1 | | cache_enabled | 0 |
tts-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
tts-1 |
tts-1 | I1009 13:57:00.692860 2432 grpc_server.cc:2562] "Started GRPCInferenceService at 0.0.0.0:8001"
tts-1 | I1009 13:57:00.693159 2432 http_server.cc:4832] "Started HTTPService at 0.0.0.0:8000"
tts-1 | I1009 13:57:00.734753 2432 http_server.cc:358] "Started Metrics Service at 0.0.0.0:8002"
tts-1 | W1009 13:57:30.688489 2432 python_be.cc:1759] "Failed to share CUDA memory pool with stub process: Failed to open the cudaIpcHandle. error: invalid resource handle. Will use CUDA IPC."
tts-1 | I1009 13:57:30.688588 2432 pb_stub.cc:1464] "Failed to initialize CUDA shared memory pool in Python stub: Failed to open the cudaIpcHandle. error: invalid resource handle"
tts-1 | E1009 13:57:30.688810 2432 pb_stub.cc:428] "An error occurred while trying to load GPU buffers in the Python backend stub: Failed to open the cudaIpcHandle. error: invalid resource handle\n"
tts-1 | W1009 13:57:30.837214 2432 python_be.cc:1759] "Failed to share CUDA memory pool with stub process: Failed to open the cudaIpcHandle. error: invalid resource handle. Will use CUDA IPC."
tts-1 | I1009 13:57:30.837186 2432 pb_stub.cc:1464] "Failed to initialize CUDA shared memory pool in Python stub: Failed to open the cudaIpcHandle. error: invalid resource handle"
tts-1 | E1009 13:57:30.837991 2432 pb_stub.cc:736] "Failed to process the request(s) for model 'cosyvoice2_0_1', message: TritonModelException: Failed to open the cudaIpcHandle. error: invalid resource handle\n\nAt:\n /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(193): forward_audio_tokenizer\n /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(336): execute\n"
tts-1 | E1009 13:57:34.597431 2432 pb_stub.cc:736] "Failed to process the request(s) for model 'audio_tokenizer_0_0', message: RuntimeError: CUDA error: invalid resource handle\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n\n\nAt:\n /usr/local/lib/python3.12/dist-packages/torch/nn/functional.py(5209): pad\n /usr/local/lib/python3.12/dist-packages/torch/functional.py(714): stft\n /usr/local/lib/python3.12/dist-packages/s3tokenizer/utils.py(258): log_mel_spectrogram\n /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/audio_tokenizer/1/model.py(82): execute\n"
tts-1 | E1009 13:57:34.597715 2432 pb_stub.cc:736] "Failed to process the request(s) for model 'cosyvoice2_0_3', message: TritonModelException: Failed to process the request(s) for model 'audio_tokenizer_0_0', message: RuntimeError: CUDA error: invalid resource handle\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with TORCH_USE_CUDA_DSA to enable device-side assertions.\n\n\nAt:\n /usr/local/lib/python3.12/dist-packages/torch/nn/functional.py(5209): pad\n /usr/local/lib/python3.12/dist-packages/torch/functional.py(714): stft\n /usr/local/lib/python3.12/dist-packages/s3tokenizer/utils.py(258): log_mel_spectrogram\n /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/audio_tokenizer/1/model.py(82): execute\n\n\nAt:\n /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(195): forward_audio_tokenizer\n /workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(336): execute\n"

以下是rpc报错log

Initializing gRPC client for streaming mode...
task-0: Initializing sync client for streaming...
task-0: Starting streaming processing for 1 items.
task-0: Processing item 0/1
Received InferenceServerException: Failed to process the request(s) for model 'cosyvoice2_0_3', message: TritonModelException: Failed to process the request(s) for model 'audio_tokenizer_0_0', message: RuntimeError: CUDA error: invalid resource handle
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

At:
/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py(5209): pad
/usr/local/lib/python3.12/dist-packages/torch/functional.py(714): stft
/usr/local/lib/python3.12/dist-packages/s3tokenizer/utils.py(258): log_mel_spectrogram
/workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/audio_tokenizer/1/model.py(82): execute

At:
/workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(195): forward_audio_tokenizer
/workspace/CosyVoice/runtime/triton_trtllm/model_repo_cosyvoice2/cosyvoice2/1/model.py(336): execute

task-0: Item 0 failed.
task-0: Closing sync client...
task-0: Finished streaming processing. Total duration synthesized: 0.0000s
Total synthesized duration is zero. Cannot calculate RTF or latency percentiles.
Mode: streaming
RTF: inf
total_duration: 0.000 seconds
(0.00 hours)
processing time: 0.035 seconds (0.00 hours)
No latency data collected.

Initializing temporary async client for fetching stats...
Fetching inference statistics...
Fetching model config...
Could not retrieve statistics or config: 'ns'
Closing temporary async stats client...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions