Skip to content

Commit 9f2dc30

Browse files
authored
[None] [doc] Update DeepSeek example doc (#7358)
Signed-off-by: jiahanc <[email protected]>
1 parent b3c57a7 commit 9f2dc30

File tree

1 file changed

+78
-18
lines changed

1 file changed

+78
-18
lines changed

examples/models/core/deepseek_v3/README.md

Lines changed: 78 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,11 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
2828
- [Evaluation](#evaluation)
2929
- [Serving](#serving)
3030
- [trtllm-serve](#trtllm-serve)
31+
- [B200 FP4 min-latency config](#b200-fp4-min-latency-config)
32+
- [B200 FP4 max-throughput config](#b200-fp4-max-throughput-config)
33+
- [B200 FP8 min-latency config](#b200-fp8-min-latency-config)
34+
- [B200 FP8 max-throughput config](#b200-fp8-max-throughput-config)
35+
- [Launch trtllm-serve OpenAI-compatible API server](#launch-trtllm-serve-openai-compatible-api-server)
3136
- [Disaggregated Serving](#disaggregated-serving)
3237
- [Dynamo](#dynamo)
3338
- [tensorrtllm\_backend for triton inference server (Prototype)](#tensorrtllm_backend-for-triton-inference-server-prototype)
@@ -228,56 +233,111 @@ trtllm-eval --model <YOUR_MODEL_DIR> \
228233
## Serving
229234
### trtllm-serve
230235

231-
Take max-throughput scenario on B200 as an example, the settings are extracted from the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-max-throughput). **For users' own models and cases, the specific settings could be different to get best performance.**
236+
Below are example B200 serving configurations for both min-latency and max-throughput in FP4 and FP8. If you want to explore configurations, see the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md). **Treat these as starting points—tune for your model and workload to achieve the best performance.**
232237

233238
To serve the model using `trtllm-serve`:
234239

240+
#### B200 FP4 min-latency config
241+
```bash
242+
cat >./extra-llm-api-config.yml <<EOF
243+
cuda_graph_config:
244+
enable_padding: true
245+
max_batch_size: 1024
246+
enable_attention_dp: false
247+
kv_cache_config:
248+
dtype: fp8
249+
stream_interval: 10
250+
EOF
251+
```
252+
253+
#### B200 FP4 max-throughput config
235254
```bash
236255
cat >./extra-llm-api-config.yml <<EOF
237256
cuda_graph_config:
238257
enable_padding: true
239258
batch_sizes:
240-
- 1
241-
- 2
242-
- 4
243-
- 8
244-
- 16
245-
- 32
246-
- 64
247-
- 128
248-
- 256
249-
- 384
250-
print_iter_log: true
259+
- 1024
260+
- 896
261+
- 512
262+
- 256
263+
- 128
264+
- 64
265+
- 32
266+
- 16
267+
- 8
268+
- 4
269+
- 2
270+
- 1
271+
kv_cache_config:
272+
dtype: fp8
273+
stream_interval: 10
251274
enable_attention_dp: true
252275
EOF
276+
```
277+
278+
#### B200 FP8 min-latency config
279+
```bash
280+
cat >./extra-llm-api-config.yml <<EOF
281+
cuda_graph_config:
282+
enable_padding: true
283+
max_batch_size: 1024
284+
enable_attention_dp: false
285+
kv_cache_config:
286+
dtype: fp8
287+
free_gpu_memory_fraction: 0.8
288+
stream_interval: 10
289+
moe_config:
290+
backend: DEEPGEMM
291+
max_num_tokens: 37376
292+
EOF
293+
```
253294

295+
#### B200 FP8 max-throughput config
296+
```bash
297+
cat >./extra-llm-api-config.yml <<EOF
298+
cuda_graph_config:
299+
enable_padding: true
300+
max_batch_size: 512
301+
enable_attention_dp: true
302+
kv_cache_config:
303+
dtype: fp8
304+
free_gpu_memory_fraction: 0.8
305+
stream_interval: 10
306+
moe_config:
307+
backend: DEEPGEMM
308+
EOF
309+
```
310+
#### Launch trtllm-serve OpenAI-compatible API server
311+
```bash
254312
trtllm-serve \
255-
deepseek-ai/DeepSeek-V3 \
313+
deepseek-ai/DeepSeek-R1 \
256314
--host localhost \
257315
--port 8000 \
258316
--backend pytorch \
259-
--max_batch_size 384 \
260-
--max_num_tokens 1536 \
317+
--max_batch_size 1024 \
318+
--max_num_tokens 8192 \
261319
--tp_size 8 \
262320
--ep_size 8 \
263321
--pp_size 1 \
264-
--kv_cache_free_gpu_memory_fraction 0.85 \
322+
--kv_cache_free_gpu_memory_fraction 0.9 \
265323
--extra_llm_api_options ./extra-llm-api-config.yml
266324
```
325+
It's possible seeing OOM issues on some configs. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. We're working on the investigation and addressing the problem. If you are using max-throughput config, reduce `max_num_tokens` to `3072` to avoid OOM issues.
267326

268327
To query the server, you can start with a `curl` command:
269328
```bash
270329
curl http://localhost:8000/v1/completions \
271330
-H "Content-Type: application/json" \
272331
-d '{
273-
"model": "deepseek-ai/DeepSeek-V3",
332+
"model": "deepseek-ai/DeepSeek-R1",
274333
"prompt": "Where is New York?",
275334
"max_tokens": 16,
276335
"temperature": 0
277336
}'
278337
```
279338

280-
For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
339+
For DeepSeek-R1 FP4, use the model name `nvidia/DeepSeek-R1-FP4-v2`.
340+
For DeepSeek-V3, use the model name `deepseek-ai/DeepSeek-V3`.
281341

282342
### Disaggregated Serving
283343

0 commit comments

Comments
 (0)