nvidia
/

NVIDIA-Nemotron-Nano-9B-v2

Text Generation

Model card Files Files and versions

Instructions to use nvidia/NVIDIA-Nemotron-Nano-9B-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVIDIA-Nemotron-Nano-9B-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-Nano-9B-v2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-Nano-9B-v2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-Nano-9B-v2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

How to use nvidia/NVIDIA-Nemotron-Nano-9B-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2

How to use nvidia/NVIDIA-Nemotron-Nano-9B-v2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVIDIA-Nemotron-Nano-9B-v2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVIDIA-Nemotron-Nano-9B-v2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-Nano-9B-v2 with Docker Model Runner:
```
docker model run hf.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2
```

NVIDIA-Nemotron-Nano-9B-v2

Commit History

Update modeling_nemotron_h.py (#36)

6533e8d

dmax123 commited on Mar 5

Update config.json (#35)

a4fb579
verified

suhara commited on Jan 8

Update README.md

bce37e2
verified

suhara commited on Dec 5, 2025

Fixing nested JSON args parsing for tool-calls in streaming (#32)

7d4e437
verified

ameyasunilm commited on Nov 25, 2025

Updating streaming tool-call parser to return ChoiceDeltaToolCall (#31)

b90f131
verified

ameyasunilm commited on Nov 19, 2025

Upload streaming tool call parser python file for vLLM (#30)

579351d
verified

ameyasunilm commited on Nov 10, 2025

Update modeling_nemotron_h.py

dbe2b5b
verified

suhara commited on Nov 4, 2025

Update README.md

d97784d
verified

suhara commited on Oct 15, 2025

Update README.md

c9beb84
verified

suhara commited on Oct 9, 2025

Update README.md

e5610bb
verified

suhara commited on Oct 2, 2025

Update README.md

dc376c2
verified

suhara commited on Aug 30, 2025

Use remaining_tokens for max_tokens in vLLM token budget demo

41409e7
verified

suhara commited on Aug 27, 2025

Update README.md

b5c2277
verified

suhara commited on Aug 26, 2025

Update README.md

3298c11
verified

suhara commited on Aug 26, 2025

Update README.md (#13)

ecde253
verified

igitman commited on Aug 25, 2025

Fix model_max_length (#11)

d99d974
verified

suhara commited on Aug 21, 2025

Upload modeling_nemotron_h.py (#10)

1370501
verified

suhara commited on Aug 21, 2025

Update README.md

a180ba8
verified

suhara commited on Aug 21, 2025

Update config.json

7659b75
verified

suhara commited on Aug 21, 2025

Update README.md

20547a9
verified

suhara commited on Aug 20, 2025

Updating evaluation details for RULER (reasoning off) (#6)

4a28fbc
verified

ameyasunilm commited on Aug 20, 2025

Update README.md

bd0d6d5
verified

suhara commited on Aug 19, 2025

Update README.md

a550406
verified

suhara commited on Aug 19, 2025

Update README.md

60cfbe0
verified

Sharath Turuvekere Sreenivas commited on Aug 18, 2025

Update README.md

d566bdf
verified

suhara commited on Aug 18, 2025

Update README.md

39d09ce
verified

suhara commited on Aug 18, 2025

Update README.md

786a9ff
verified

suhara commited on Aug 18, 2025

Update README.md

4f47b96
verified

suhara commited on Aug 18, 2025

Update README.md

18c4545
verified

suhara commited on Aug 18, 2025

Minor fixes in example code snippets and chat template description (#2)

92f6429
verified

ameyasunilm commited on Aug 18, 2025

Update README.md

3fa6107
verified

Sharath Turuvekere Sreenivas commited on Aug 18, 2025

Upload accuracy_chart.png

3ce744b
verified

suhara commited on Aug 18, 2025

Update README.md

dd33c82
verified

suhara commited on Aug 18, 2025

Upload accuracy_chart.png

2f9072b
verified

suhara commited on Aug 18, 2025

Update README.md

571bcd3
verified

suhara commited on Aug 18, 2025

Update README.md

148df52
verified

suhara commited on Aug 18, 2025

Update README.md

5d3206d
verified

suhara commited on Aug 18, 2025

Upload acc-vs-budget.png

c3c61c7
verified

suhara commited on Aug 18, 2025

Upload 5 files

0a9d93a
verified

suhara commited on Aug 18, 2025

Update chat template for adding sys header even when no sys message, and allowing assistant response pre-fixing (#1)

b27e387
verified

ameyasunilm commited on Aug 15, 2025

Upload nemotron_toolcall_parser_no_streaming.py

56ed81d
verified

suhara commited on Aug 13, 2025

Upload folder using huggingface_hub

8550779
verified

suhara commited on Aug 12, 2025

initial commit

64917be
verified

suhara commited on Aug 12, 2025