[None] [doc] Update DeepSeek example doc (#7358)

jiahanc · web-flow · commit 9f2dc3069d46 · 2025-09-01T14:43:58.000-04:00
Signed-off-by: jiahanc &lt;173873397+jiahanc@users.noreply.github.com&gt;
diff --git a/examples/models/core/deepseek_v3/README.md b/examples/models/core/deepseek_v3/README.md
@@ -28,6 +28,11 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
   - [Evaluation](#evaluation)
   - [Serving](#serving)
     - [trtllm-serve](#trtllm-serve)
+      - [B200 FP4 min-latency config](#b200-fp4-min-latency-config)
+      - [B200 FP4 max-throughput config](#b200-fp4-max-throughput-config)
+      - [B200 FP8 min-latency config](#b200-fp8-min-latency-config)
+      - [B200 FP8 max-throughput config](#b200-fp8-max-throughput-config)
+      - [Launch trtllm-serve OpenAI-compatible API server](#launch-trtllm-serve-openai-compatible-api-server)
     - [Disaggregated Serving](#disaggregated-serving)
     - [Dynamo](#dynamo)
     - [tensorrtllm\_backend for triton inference server (Prototype)](#tensorrtllm_backend-for-triton-inference-server-prototype)
@@ -228,56 +233,111 @@ trtllm-eval --model  <YOUR_MODEL_DIR> \
 ## Serving
 ### trtllm-serve
 
-Take max-throughput scenario on B200 as an example, the settings are extracted from the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-max-throughput). **For users' own models and cases, the specific settings could be different to get best performance.**
+Below are example B200 serving configurations for both min-latency and max-throughput in FP4 and FP8. If you want to explore configurations, see the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md). **Treat these as starting points—tune for your model and workload to achieve the best performance.**
 
 To serve the model using `trtllm-serve`:
 
+#### B200 FP4 min-latency config
+```bash
+cat >./extra-llm-api-config.yml <<EOF
+cuda_graph_config:
+    enable_padding: true
+    max_batch_size: 1024
+enable_attention_dp: false
+kv_cache_config:
+    dtype: fp8
+stream_interval: 10
+EOF
+```
+
+#### B200 FP4 max-throughput config
 ```bash
 cat >./extra-llm-api-config.yml <<EOF
 cuda_graph_config:
   enable_padding: true
   batch_sizes:
-    - 1
-    - 2
-    - 4
-    - 8
-    - 16
-    - 32
-    - 64
-    - 128
-    - 256
-    - 384
-print_iter_log: true
+  - 1024
+  - 896
+  - 512
+  - 256
+  - 128
+  - 64
+  - 32
+  - 16
+  - 8
+  - 4
+  - 2
+  - 1
+kv_cache_config:
+  dtype: fp8
+stream_interval: 10
 enable_attention_dp: true
 EOF
+```
+
+#### B200 FP8 min-latency config
+```bash
+cat >./extra-llm-api-config.yml <<EOF
+cuda_graph_config:
+    enable_padding: true
+    max_batch_size: 1024
+enable_attention_dp: false
+kv_cache_config:
+    dtype: fp8
+    free_gpu_memory_fraction: 0.8
+stream_interval: 10
+moe_config:
+    backend: DEEPGEMM
+    max_num_tokens: 37376
+EOF
+```
 
+#### B200 FP8 max-throughput config
+```bash
+cat >./extra-llm-api-config.yml <<EOF
+cuda_graph_config:
+    enable_padding: true
+    max_batch_size: 512
+enable_attention_dp: true
+kv_cache_config:
+    dtype: fp8
+    free_gpu_memory_fraction: 0.8
+stream_interval: 10
+moe_config:
+    backend: DEEPGEMM
+EOF
+```
+#### Launch trtllm-serve OpenAI-compatible API server
+```bash
 trtllm-serve \
-  deepseek-ai/DeepSeek-V3 \
+  deepseek-ai/DeepSeek-R1 \
   --host localhost \
   --port 8000 \
   --backend pytorch \
-  --max_batch_size 384 \
-  --max_num_tokens 1536 \
+  --max_batch_size 1024 \
+  --max_num_tokens 8192 \
   --tp_size 8 \
   --ep_size 8 \
   --pp_size 1 \
-  --kv_cache_free_gpu_memory_fraction 0.85 \
+  --kv_cache_free_gpu_memory_fraction 0.9 \
   --extra_llm_api_options ./extra-llm-api-config.yml
 ```
+It's possible seeing OOM issues on some configs. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. We're working on the investigation and addressing the problem. If you are using max-throughput config, reduce `max_num_tokens` to `3072` to avoid OOM issues.
 
 To query the server, you can start with a `curl` command:
 ```bash
 curl http://localhost:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
-      "model": "deepseek-ai/DeepSeek-V3",
+      "model": "deepseek-ai/DeepSeek-R1",
       "prompt": "Where is New York?",
       "max_tokens": 16,
       "temperature": 0
   }'
 ```
 
-For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
+For DeepSeek-R1 FP4, use the model name `nvidia/DeepSeek-R1-FP4-v2`.  
+For DeepSeek-V3, use the model name `deepseek-ai/DeepSeek-V3`.
 
 ### Disaggregated Serving