RedHatAI
/

SmolLM3-3B-FP8-dynamic

+---
+library_name: vllm
+license: apache-2.0
+language:
+  - en
+  - fr
+  - es
+  - it
+  - pt
+  - zh
+  - ar
+  - ru
+base_model:
+  - HuggingFaceTB/SmolLM3-3B
+tags:
+- neuralmagic
+- redhat
+- llmcompressor
+- fp8
+- quantized
+---
+## Model Overview
+- **Model Architecture:** SmolLM3-3B
+  - **Input:** Text
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** FP8
+  - **Activation quantization:** FP8
+- **Release Date:** 07/28/2025
+- **Version:** 1.0
+- **License(s):** Apache-2.0
+- **Model Developers:** RedHat (Neural Magic)
+### Model Optimizations
+This model was obtained by quantizing activation and weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to FP8 data type.
+This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
+Weight quantization also reduces disk size requirements by approximately 50%.
+Only weights and activations of the linear operators within transformers blocks are quantized.
+Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
+The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
+## Deployment
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+model_id = "RedHatAI/SmolLM3-3B-FP8-dynamic"
+number_gpus = 1
+sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+messages = [
+    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
+    {"role": "user", "content": "Who are you?"},
+]
+prompts = tokenizer.apply_chat_template(messages, tokenize=False)
+llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
+outputs = llm.generate(prompts, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
+vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+## Creation
+<details>
+  <summary>Creation details</summary>
+  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
+  ```python
+  from transformers import AutoModelForCausalLM, AutoTokenizer
+  from llmcompressor.modifiers.quantization import QuantizationModifier
+  from llmcompressor.transformers import oneshot
+  # Load model
+  model_stub = "HuggingFaceTB/SmolLM3-3B"
+  model_name = model_stub.split("/")[-1]
+  tokenizer = AutoTokenizer.from_pretrained(model_stub)
+  model = AutoModelForCausalLM.from_pretrained(
+      model_stub,
+      device_map="auto",
+      torch_dtype="auto",
+  )
+  # Configure the quantization algorithm and scheme
+  recipe = QuantizationModifier(
+      targets="Linear",
+      scheme="FP8_dynamic",
+      ignore=["lm_head"],
+  )
+  # Apply quantization
+  oneshot(
+      model=model,
+      recipe=recipe,
+  )
+  # Save to disk in compressed-tensors format
+  save_path = model_name + "-FP8-dynamic"
+  model.save_pretrained(save_path)
+  tokenizer.save_pretrained(save_path)
+  print(f"Model and tokenizer saved to: {save_path}")
+  ```
+</details>
+## Evaluation
+This model was evaluated on the well-known reasoning tasks: AIME24, Math-500, and GPQA-Diamond.
+In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine, and evals are collected through [LightEval](https://github.com/huggingface/lighteval) library.
+<details>
+  <summary>Evaluation details</summary>
+  ```
+    export VLLM_WORKER_MULTIPROC_METHOD=spawn
+    export MODEL="RedHatAI/SmolLM3-3B-FP8-dynamic"
+    export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+    export TASK=aime24 # {aime24, math_500, gpqa:diamond}
+    lighteval vllm $MODEL_ARGS "lighteval|${TASK}|0|0" \
+        --use-chat-template \
+        --output-dir out_dir
+  ```
+</details>
+### Accuracy
+<table>
+  <tr>
+   <th>Category
+   </th>
+   <th>Benchmark
+   </th>
+   <th>HuggingFaceTB/SmolLM3-3B
+   </th>
+   <th>RedHatAI/SmolLM3-3B-FP8-dynamic<br>(this model)
+   </th>
+   <th>Recovery
+   </th>
+  </tr>
+  <tr>
+   <td rowspan="8" ><strong>Reasoning</strong>
+   </td>
+   <td>AIME24 (pass@1:64)
+   </td>
+   <td>45.31
+   </td>
+   <td>47.50
+   </td>
+   <td>104.83%
+   </td>
+  </tr>
+  <tr>
+   <td>MATH-500 (pass@1:4)
+   </td>
+   <td>89.30
+   </td>
+   <td>88.30
+   </td>
+   <td>98.88%
+   </td>
+  </tr>
+  <tr>
+   <td>GPQA-Diamond (pass@1:8)
+   </td>
+   <td>41.22
+   </td>
+   <td>40.91
+   </td>
+   <td>99.25%
+   </td>
+  </tr>
+  <tr>
+   <td>GSM-8K (CoT, 8-shot, strict-match)
+   </td>
+   <td>94.16
+   </td>
+   <td>94.92
+   </td>
+   <td>100.8%
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Average</strong>
+   </td>
+   <td><strong>58.61</strong>
+   </td>
+   <td><strong>58.90</strong>
+   </td>
+   <td><strong>100.5%</strong>
+   </td>
+  </tr>
+  <tr>
+</table>