Text Generation
Transformers
Safetensors
PyTorch
nvidia
conversational
suhara commited on
Commit
bd0d6d5
·
verified ·
1 Parent(s): a550406

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -5
README.md CHANGED
@@ -248,13 +248,10 @@ print(outputs[0].outputs[0].text)
248
 
249
  ### **Use it with vLLM**
250
 
251
- The snippet below shows how to use this model with vLLM. Use the following [commit](https://github.com/vllm-project/vllm/commit/75531a6c134282f940c86461b3c40996b4136793) and follow these instructions to build and install vLLM in a docker container.
252
 
253
  ```shell
254
- git clone https://github.com/vllm-project/vllm.git
255
- cd vllm
256
- git checkout bf756321c72340466911b64602e88013d0210c1c
257
- VLLM_USE_PRECOMPILED=1 pip install -e .
258
  ```
259
 
260
  Now you can run run the server with:
@@ -265,6 +262,7 @@ vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
265
  --mamba_ssm_cache_dtype float32
266
  ```
267
 
 
268
  Note: Remember to add \`--mamba\_ssm\_cache\_dtype float32\` for accurate quality. Without this option, the model’s accuracy may degrade.
269
 
270
  #### Using Budget Control with a vLLM Server
 
248
 
249
  ### **Use it with vLLM**
250
 
251
+ The snippet below shows how to use this model with vLLM. Use the latest version of vLLM and follow these instructions to build and install vLLM.
252
 
253
  ```shell
254
+ pip install -U "vllm>=0.10.1"
 
 
 
255
  ```
256
 
257
  Now you can run run the server with:
 
262
  --mamba_ssm_cache_dtype float32
263
  ```
264
 
265
+
266
  Note: Remember to add \`--mamba\_ssm\_cache\_dtype float32\` for accurate quality. Without this option, the model’s accuracy may degrade.
267
 
268
  #### Using Budget Control with a vLLM Server