nvidia
/

NVIDIA-Nemotron-Nano-9B-v2

@@ -385,7 +385,7 @@ git clone https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2
 vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
   --trust-remote-code \
-  --mamba_ssm_cache_dtype float32
   --enable-auto-tool-choice \
   --tool-parser-plugin "NVIDIA-Nemotron-Nano-9B-v2/nemotron_toolcall_parser_no_streaming.py" \
   --tool-call-parser "nemotron_json"
@@ -479,7 +479,7 @@ Okay, let's see. The user has a bill of $100 and wants to know the amount for an
 ## Prompt Format
-We follow the jinja chat template provided below. This template conditionally adds `<think>\n` to the start of the Assistant response if `/think` is found in the system prompt or if no reasoning signal is added, and adds `<think></think>` to the start of the Assistant response if `/no_think` is found in the system prompt. Thus enforcing reasoning on/off behavior.
 ```
 {%- set ns = namespace(enable_thinking = true) %}

 vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
   --trust-remote-code \
+  --mamba_ssm_cache_dtype float32 \
   --enable-auto-tool-choice \
   --tool-parser-plugin "NVIDIA-Nemotron-Nano-9B-v2/nemotron_toolcall_parser_no_streaming.py" \
   --tool-call-parser "nemotron_json"
 ## Prompt Format
+We follow the jinja chat template provided below. This template conditionally adds `<think>\n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds `<think></think>` to the start of the Assistant response if `/no_think` is found in the system prompt. Thus enforcing reasoning on/off behavior.
 ```
 {%- set ns = namespace(enable_thinking = true) %}