nvidia
/

DeepSeek-R1-0528-FP4

Text Generation

Model Optimizer

Model card Files Files and versions Community

meenchen commited on Jun 9

Commit

b1c27cd

·

1 Parent(s): 42d5f6f

update for min latency server

Files changed (1) hide show

README.md +46 -1

README.md CHANGED Viewed

@@ -88,7 +88,7 @@ This model was obtained by quantizing the weights and activations of DeepSeek R1
 To deploy the quantized FP4 checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below (you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch):
-* LLM API sample usage:
 ```
 from tensorrt_llm import SamplingParams
 from tensorrt_llm._torch import LLM
@@ -120,6 +120,51 @@ if __name__ == '__main__':
 ```
 ### Evaluation
 The accuracy benchmark results are presented in the table below:
 <table>

 To deploy the quantized FP4 checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below (you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch):
+#### LLM API sample usage:
 ```
 from tensorrt_llm import SamplingParams
 from tensorrt_llm._torch import LLM
 ```
+#### Minimum Latency Server Deployment
+**Step 1: Create configuration file (`args.yaml`)**
+```yaml
+moe_backend: TRTLLM
+use_cuda_graph: true
+speculative_config:
+  decoding_type: MTP
+  num_nextn_predict_layers: 3
+  use_relaxed_acceptance_for_thinking: true
+  relaxed_topk: 10
+  relaxed_delta: 0.6
+```
+**Step 2: Start the TensorRT-LLM server**
+```bash
+trtllm-serve nvidia/DeepSeek-R1-0528-FP4 \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --backend pytorch \
+  --max_batch_size 4 \
+  --tp_size 8 \
+  --ep_size 2 \
+  --max_num_tokens 32768 \
+  --trust_remote_code \
+  --extra_llm_api_options args.yaml \
+  --kv_cache_free_gpu_memory_fraction 0.75
+```
+**Step 3: Send an example query**
+```bash
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "nvidia/DeepSeek-R1-0528-FP4",
+    "messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
+    "max_tokens": 1024
+  }'
+```
 ### Evaluation
 The accuracy benchmark results are presented in the table below:
 <table>