Update README.md
Browse files
README.md
CHANGED
@@ -248,13 +248,10 @@ print(outputs[0].outputs[0].text)
|
|
248 |
|
249 |
### **Use it with vLLM**
|
250 |
|
251 |
-
The snippet below shows how to use this model with vLLM. Use the
|
252 |
|
253 |
```shell
|
254 |
-
|
255 |
-
cd vllm
|
256 |
-
git checkout bf756321c72340466911b64602e88013d0210c1c
|
257 |
-
VLLM_USE_PRECOMPILED=1 pip install -e .
|
258 |
```
|
259 |
|
260 |
Now you can run run the server with:
|
@@ -265,6 +262,7 @@ vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
|
|
265 |
--mamba_ssm_cache_dtype float32
|
266 |
```
|
267 |
|
|
|
268 |
Note: Remember to add \`--mamba\_ssm\_cache\_dtype float32\` for accurate quality. Without this option, the model’s accuracy may degrade.
|
269 |
|
270 |
#### Using Budget Control with a vLLM Server
|
|
|
248 |
|
249 |
### **Use it with vLLM**
|
250 |
|
251 |
+
The snippet below shows how to use this model with vLLM. Use the latest version of vLLM and follow these instructions to build and install vLLM.
|
252 |
|
253 |
```shell
|
254 |
+
pip install -U "vllm>=0.10.1"
|
|
|
|
|
|
|
255 |
```
|
256 |
|
257 |
Now you can run run the server with:
|
|
|
262 |
--mamba_ssm_cache_dtype float32
|
263 |
```
|
264 |
|
265 |
+
|
266 |
Note: Remember to add \`--mamba\_ssm\_cache\_dtype float32\` for accurate quality. Without this option, the model’s accuracy may degrade.
|
267 |
|
268 |
#### Using Budget Control with a vLLM Server
|