You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Take max-throughput scenario on B200 as an example, the settings are extracted from the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-max-throughput). **For users' own models and cases, the specific settings could be different to get best performance.**
236
+
Below are example B200 serving configurations for both min-latency and max-throughput in FP4 and FP8. If you want to explore configurations, see the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md). **Treat these as starting points—tune for your model and workload to achieve the best performance.**
232
237
233
238
To serve the model using `trtllm-serve`:
234
239
240
+
#### B200 FP4 min-latency config
241
+
```bash
242
+
cat >./extra-llm-api-config.yml <<EOF
243
+
cuda_graph_config:
244
+
enable_padding: true
245
+
max_batch_size: 1024
246
+
enable_attention_dp: false
247
+
kv_cache_config:
248
+
dtype: fp8
249
+
stream_interval: 10
250
+
EOF
251
+
```
252
+
253
+
#### B200 FP4 max-throughput config
235
254
```bash
236
255
cat >./extra-llm-api-config.yml <<EOF
237
256
cuda_graph_config:
238
257
enable_padding: true
239
258
batch_sizes:
240
-
- 1
241
-
- 2
242
-
- 4
243
-
- 8
244
-
- 16
245
-
- 32
246
-
- 64
247
-
- 128
248
-
- 256
249
-
- 384
250
-
print_iter_log: true
259
+
- 1024
260
+
- 896
261
+
- 512
262
+
- 256
263
+
- 128
264
+
- 64
265
+
- 32
266
+
- 16
267
+
- 8
268
+
- 4
269
+
- 2
270
+
- 1
271
+
kv_cache_config:
272
+
dtype: fp8
273
+
stream_interval: 10
251
274
enable_attention_dp: true
252
275
EOF
276
+
```
277
+
278
+
#### B200 FP8 min-latency config
279
+
```bash
280
+
cat >./extra-llm-api-config.yml <<EOF
281
+
cuda_graph_config:
282
+
enable_padding: true
283
+
max_batch_size: 1024
284
+
enable_attention_dp: false
285
+
kv_cache_config:
286
+
dtype: fp8
287
+
free_gpu_memory_fraction: 0.8
288
+
stream_interval: 10
289
+
moe_config:
290
+
backend: DEEPGEMM
291
+
max_num_tokens: 37376
292
+
EOF
293
+
```
253
294
295
+
#### B200 FP8 max-throughput config
296
+
```bash
297
+
cat >./extra-llm-api-config.yml <<EOF
298
+
cuda_graph_config:
299
+
enable_padding: true
300
+
max_batch_size: 512
301
+
enable_attention_dp: true
302
+
kv_cache_config:
303
+
dtype: fp8
304
+
free_gpu_memory_fraction: 0.8
305
+
stream_interval: 10
306
+
moe_config:
307
+
backend: DEEPGEMM
308
+
EOF
309
+
```
310
+
#### Launch trtllm-serve OpenAI-compatible API server
It's possible seeing OOM issues on some configs. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. We're working on the investigation and addressing the problem. If you are using max-throughput config, reduce `max_num_tokens` to `3072` to avoid OOM issues.
267
326
268
327
To query the server, you can start with a `curl` command:
269
328
```bash
270
329
curl http://localhost:8000/v1/completions \
271
330
-H "Content-Type: application/json" \
272
331
-d '{
273
-
"model": "deepseek-ai/DeepSeek-V3",
332
+
"model": "deepseek-ai/DeepSeek-R1",
274
333
"prompt": "Where is New York?",
275
334
"max_tokens": 16,
276
335
"temperature": 0
277
336
}'
278
337
```
279
338
280
-
For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
339
+
For DeepSeek-R1 FP4, use the model name `nvidia/DeepSeek-R1-FP4-v2`.
340
+
For DeepSeek-V3, use the model name `deepseek-ai/DeepSeek-V3`.
0 commit comments