Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -23,7 +23,7 @@ The model is suitable for mobile deployment with [ExecuTorch](https://github.com
 See [Exporting to ExecuTorch](#exporting-to-executorch) for exporting the quantized model to an ExecuTorch pte file.  We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
 # Running in a mobile app
-The [PTE file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
@@ -37,7 +37,7 @@ pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/c
 ```
 ## Untie Embedding Weights
-Before quantization, since we need quantize input embedding and unembedding (lm_head) layer which are tied, but we want to quantize them separately, we first need to untie the model:
 ```Py
 from transformers import (

 See [Exporting to ExecuTorch](#exporting-to-executorch) for exporting the quantized model to an ExecuTorch pte file.  We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
 # Running in a mobile app
+The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone.  See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
 ```
 ## Untie Embedding Weights
+We want to quantize the embedding and lm_head differently.  Since those layers are tied, we first need to untie the model:
 ```Py
 from transformers import (