pytorch
/

SmolLM3-3B-INT8-INT4

Text Generation

optimum-executorch

Model card Files Files and versions

guangy10 commited on Jul 10

Commit

008afd2

·

verified ·

1 Parent(s): 7095ff6

Updated model card

Files changed (1) hide show

README.md +12 -14

README.md CHANGED Viewed

@@ -11,25 +11,23 @@ tags:
 - smollm
 ---
-# Run on-device with ExecuTorch
-This optimized model is exported to ExecuTorch and can run on edge devices.
-Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), you can directly download the `*.pte` and tokenizer file and run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
-## Export to ExecuTorch
-First need to install the required packages:
-```Shell
-pip install git+https://github.com/huggingface/optimum-executorch@main
-cd optimum-executorch
-```
-Then update the dependencies to latest in order to work on the SmolLM3-3B:
-```Py
-python install_dev.py
-```
-Use `optimum-cli` to export the model to ExecuTorch:
 ```Shell
 optimum-cli export executorch \
   --model HuggingFaceTB/SmolLM3-3B \

 - smollm
 ---
+[HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) is quantized using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (`8da4w`). It is then lowered to [ExecuTorch](https://github.com/pytorch/executorch) with several optimizations—custom SPDA, custom KV cache, and parallel prefill—to achieve high performance on the CPU backend, making it well-suited for mobile deployment.
+We provide the [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-8da4w/blob/main/smollm3-3b-8da4w.pte) for direct use in ExecuTorch. *(The provided pte file is exported with the default max_seq_length/max_context_length of 2k.)*
+# Running in a mobile app
+The [.pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the instructions for doing this in [iOS](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) and [Android](https://docs.pytorch.org/executorch/main/llm/llama-demo-android.html).
+On Google's Pixel 8 Pro, the model runs at 12.7 tokens/s.
+# Running with ExecuTorch’s sample runner
+You can also run this model using ExecuTorch’s sample runner following [Step 3&4 in this instruction](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-3-run-on-your-computer-to-validate)
+# Export Recipe
+You can re-create the `.pte` file from eager source using this export recipe.
+First install `optimum-executorch` by following this [instruction](https://github.com/huggingface/optimum-executorch?tab=readme-ov-file#-quick-installation), then you can use `optimum-cli` to export the model to ExecuTorch:
 ```Shell
 optimum-cli export executorch \
   --model HuggingFaceTB/SmolLM3-3B \