meenchen commited on
Commit
b1c27cd
·
1 Parent(s): 42d5f6f

update for min latency server

Browse files
Files changed (1) hide show
  1. README.md +46 -1
README.md CHANGED
@@ -88,7 +88,7 @@ This model was obtained by quantizing the weights and activations of DeepSeek R1
88
 
89
  To deploy the quantized FP4 checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below (you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch):
90
 
91
- * LLM API sample usage:
92
  ```
93
  from tensorrt_llm import SamplingParams
94
  from tensorrt_llm._torch import LLM
@@ -120,6 +120,51 @@ if __name__ == '__main__':
120
 
121
  ```
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ### Evaluation
124
  The accuracy benchmark results are presented in the table below:
125
  <table>
 
88
 
89
  To deploy the quantized FP4 checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below (you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch):
90
 
91
+ #### LLM API sample usage:
92
  ```
93
  from tensorrt_llm import SamplingParams
94
  from tensorrt_llm._torch import LLM
 
120
 
121
  ```
122
 
123
+
124
+ #### Minimum Latency Server Deployment
125
+
126
+
127
+ **Step 1: Create configuration file (`args.yaml`)**
128
+
129
+ ```yaml
130
+ moe_backend: TRTLLM
131
+ use_cuda_graph: true
132
+ speculative_config:
133
+ decoding_type: MTP
134
+ num_nextn_predict_layers: 3
135
+ use_relaxed_acceptance_for_thinking: true
136
+ relaxed_topk: 10
137
+ relaxed_delta: 0.6
138
+ ```
139
+
140
+ **Step 2: Start the TensorRT-LLM server**
141
+
142
+ ```bash
143
+ trtllm-serve nvidia/DeepSeek-R1-0528-FP4 \
144
+ --host 0.0.0.0 \
145
+ --port 8000 \
146
+ --backend pytorch \
147
+ --max_batch_size 4 \
148
+ --tp_size 8 \
149
+ --ep_size 2 \
150
+ --max_num_tokens 32768 \
151
+ --trust_remote_code \
152
+ --extra_llm_api_options args.yaml \
153
+ --kv_cache_free_gpu_memory_fraction 0.75
154
+ ```
155
+
156
+ **Step 3: Send an example query**
157
+
158
+ ```bash
159
+ curl localhost:8000/v1/chat/completions \
160
+ -H "Content-Type: application/json" \
161
+ -d '{
162
+ "model": "nvidia/DeepSeek-R1-0528-FP4",
163
+ "messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
164
+ "max_tokens": 1024
165
+ }'
166
+ ```
167
+
168
  ### Evaluation
169
  The accuracy benchmark results are presented in the table below:
170
  <table>