pytorch
/

Qwen3-4B-INT8-INT4

@@ -21,11 +21,11 @@ pipeline_tag: text-generation
 [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
 The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
-We provide the [quantized pte](TODO: ADD LINK) for direct use in ExecuTorch.
 (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
 # Running in a mobile app
-The [pte file](TODO: ADD LINK) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at [TODO: ADD] tokens/sec and uses [TODO: ADD] Mb of memory.
 [TODO: ADD SCREENSHOT]
@@ -174,12 +174,12 @@ Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation
 ## baseline
 ```Shell
-lm_eval --model hf --model_args pretrained=Qwen3/Qwen3-4B --tasks hellaswag --device cuda:0 --batch_size auto
 ```
 ## int8 dynamic activation and int4 weight quantization (8da4w)
 ```Shell
-lm_eval --model hf --model_args pretrained=TODO:ADD LINK --tasks hellaswag --device cuda:0 --batch_size auto
 ```
 | Benchmark                        |                |                           |
@@ -205,8 +205,8 @@ lm_eval --model hf --model_args pretrained=TODO:ADD LINK --tasks hellaswag --dev
 We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
 Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
-We first convert the [quantized checkpoint](TODO: ADD LINK) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
-The following script does this for you.  We have uploaded the converted checkpoint [pytorch_model_converted.bin](TODO: ADD LINK) for convenience.
 ```Shell
 python -m executorch.examples.models.qwen3.convert_weights pytorch_model.bin pytorch_model_converted.bin
 ```

 [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
 The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
+We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/qwen3-4b-1024-ctx.pte) for direct use in ExecuTorch.
 (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
 # Running in a mobile app
+The [pte file](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/qwen3-4b-1024-ctx.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at [TODO: ADD] tokens/sec and uses [TODO: ADD] Mb of memory.
 [TODO: ADD SCREENSHOT]
 ## baseline
 ```Shell
+lm_eval --model hf --model_args pretrained=Qwen3/Qwen3-4B --tasks mmlu --device cuda:0 --batch_size auto
 ```
 ## int8 dynamic activation and int4 weight quantization (8da4w)
 ```Shell
+lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-8da4w --tasks mmlu --device cuda:0 --batch_size auto
 ```
 | Benchmark                        |                |                           |
 We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
 Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
+We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
+The following script does this for you.  We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/pytorch_model_converted.bin) for convenience.
 ```Shell
 python -m executorch.examples.models.qwen3.convert_weights pytorch_model.bin pytorch_model_converted.bin
 ```