Cosmos-Reason2-2B-NVFP4A16

Optimized version of nvidia/Cosmos-Reason2-2B using Quantization. Optimized for reduced GPU memory usage and improved inference efficiency while maintaining high-quality multimodal reasoning performance.

This model was created by quantizing the base model to FP4 weights while keeping activations in FP16 precision. The model preserves the reasoning capabilities of the original Cosmos-Reason2-2B model while significantly reducing the memory footprint of model weights.

NVFP4 quantization is a Blackwell-generation feature (e.g., supported in Jetson AGX Thor). Earlier architectures (Ampere, Ada, Hopper) do not provide native hardware support for NVFP4 tensor core acceleration.

For more efficient inference, Embedl’s proprietary optimizations and architectural enhancements require patching vLLM. These updates will be released at a later date. For now, the model can be used with vLLM through the NVIDIA Jetson container.

NVFP4 quantization is a Blackwell-generation feature (e.g., supported in Jetson AGX Thor). Earlier architectures (Ampere, Ada, Hopper) do not provide native hardware support for NVFP4 tensor core acceleration. For systems not supporting FP4, we recommend using embedl/Cosmos-Reason2-2B-W4A16.

Utilization of NVFP4 quantization is backend-dependent. Not all versions of vLLM provide full or optimized support for NVFP4.

Output Example

Real output captured from NVIDIA Jetson AGX Thor

system: You are a helpful assistant.

user: Describe this video in detail: https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4

assistant: The video captures a scene of a highway winding through a wooded area. The road is divided into two lanes, with vehicles traveling in both directions. Cars of various colors and models are seen moving along the highway, some maintaining a steady pace while others navigate curves with apparent ease. Drivers appear attentive, adhering to traffic regulations and ensuring safe navigation through the bends. Amidst this routine movement, a lone individual in a red jacket stands out, seemingly lost in thought or perhaps enjoying the solitude of the surroundings.

On the right side of the highway, there is a pedestrian path that runs parallel to the road. This path is bordered by leafless trees, suggesting it might be late autumn or early spring. Sparse vegetation lines the edge of the path, blending seamlessly with the natural landscape. The trees stand tall and bare, their branches stark against the clear sky.

Prominent in the foreground on the right side of the frame is a distinctive yellow and black striped pole, likely a warning or safety marker for drivers. The pole stands out due to its bright colors, contrasting sharply with the muted tones of the surrounding environment. [...]

Model Details

Field	Value
Base Model	nvidia/Cosmos-Reason2-2B
Input / Output	Text + Image / Video → Text
Release Date	2026-02-24
Version	1.0
Optimizations	Quantization (NVFP4A16)
Developers	Embedl
Licenses	Upstream: NVIDIA Open Model License. Additional Information: Apache License 2.0. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose NLP on NVIDIA GPUs

Optimizations

Quantization (NVFP4A16) - large reduction in memory footprint and latency.

Accuracy

For comparative evaluation, we present benchmark scores using the Physical AI Bench Reason Task .

We have not been able to reproduce the baseline benchmarks reported by nvidia/Cosmos-Reason2-2B on the Physical AI Bench Leaderboard, see related issue: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52

Overall + Category Scores

Model	Overall	Embodied Reasoning	Common Sense
nvidia/Cosmos-Reason2-2B	50.60	53.93	47.19
embedl/Cosmos-Reason2-2B-NVFP4A16	49.84	50.16	49.50
embedl/Cosmos-Reason2-2B-W4A16	48.68	50.49	46.85
embedl/Cosmos-Reason2-2B-W4A16-Edge2	50.58	53.61	47.52

Subcategory Scores

Model	AV	Physical World	Time	Space	Agibot	HoloAssist	RoboFail	RoboVQA	BridgeData V2
nvidia/Cosmos-Reason2-2B	44.00	46.90	45.30	55.00	34.00	60.00	49.00	90.91	42.00
embedl/Cosmos-Reason2-2B-NVFP4A16	44.00	45.13	52.01	52.50	28.00	58.00	51.00	84.55	32.00
embedl/Cosmos-Reason2-2B-W4A16	36.00	47.79	44.30	53.75	36.00	61.00	42.00	80.91	44.00
embedl/Cosmos-Reason2-2B-W4A16-Edge2	45.00	44.25	48.66	52.50	32.00	59.00	54.00	85.45	43.00

Performance

On-device performance benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Usage Examples

vLLM Video Inference

vLLM image: NVIDIA vLLM 26.01

Test Hardware: NVIDIA Jetson AGX Thor

--gpu-memory-utilization and --max-model-len should be adapted to system specifications (i.e., available RAM).

docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  -e HF_TOKEN=hf_*** \
  -e HF_HOME=/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "embedl/Cosmos-Reason2-2B-NVFP4A16" \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9

gpu_memory_utilization and max_num_seqs should be adapted to system specifications (i.e., available RAM).

from vllm import LLM, SamplingParams

if __name__ == "__main__":

    model = "embedl/Cosmos-Reason2-2B-NVFP4A16"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant."}
            ],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {"url": video_url, "fps": 4},
                },
                {
                    "type": "text",
                    "text": "Describe this video in detail.",
                },
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {
                "count": 1,
                "num_frames": 12,
                "width": 1920,
                "height": 1080,
            },
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=16384,
        mm_processor_kwargs={"truncation": False},
        disable_log_stats=False,
        gpu_memory_utilization=0.9,
    )

    output = llm.chat(
        messages,
        sampling_params=SamplingParams(max_tokens=256),
    )
    print(output[0].outputs[0].text)

Transformers Inference

Test Hardware: NVIDIA H200 GPU

Adapted from nvidia/Cosmos-Reason2-2B.

import torch
import transformers

if __name__ == "__main__":
    model_name = "embedl/Cosmos-Reason2-2B-NVFP4A16"
    model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
        model_name,
        device_map="auto",
        attn_implementation="sdpa",
        dtype="bfloat16",
    )
    processor: transformers.Qwen3VLProcessor = (
        transformers.AutoProcessor.from_pretrained(model_name)
    )
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    video_messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "video", "video": video_url, "fps": 4},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    # Process inputs
    inputs = processor.apply_chat_template(
        video_messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        truncation=False,
        fps=4,
    )
    inputs = inputs.to(model.device)

    # Run inference
    generated_ids = model.generate(**inputs, max_new_tokens=8192)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :]
        for in_ids, out_ids in zip(
            inputs.input_ids, generated_ids, strict=False
        )
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )
    print(output_text[0])

License

Built on NVIDIA Cosmos

This model is a derivative of nvidia/Cosmos-Reason2-2B.

Licensed by NVIDIA Corporation under the NVIDIA Open Model License

Upstream: NVIDIA Open Model License
Additional Information: Apache License 2.0
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)