Update README.md
Browse files
README.md
CHANGED
|
@@ -23,7 +23,7 @@ The model is suitable for mobile deployment with [ExecuTorch](https://github.com
|
|
| 23 |
See [Exporting to ExecuTorch](#exporting-to-executorch) for exporting the quantized model to an ExecuTorch pte file. We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
|
| 24 |
|
| 25 |
# Running in a mobile app
|
| 26 |
-
The [
|
| 27 |
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
| 28 |
|
| 29 |

|
|
@@ -37,7 +37,7 @@ pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/c
|
|
| 37 |
```
|
| 38 |
|
| 39 |
## Untie Embedding Weights
|
| 40 |
-
|
| 41 |
|
| 42 |
```Py
|
| 43 |
from transformers import (
|
|
|
|
| 23 |
See [Exporting to ExecuTorch](#exporting-to-executorch) for exporting the quantized model to an ExecuTorch pte file. We also provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) for direct use.
|
| 24 |
|
| 25 |
# Running in a mobile app
|
| 26 |
+
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
| 27 |
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
| 28 |
|
| 29 |

|
|
|
|
| 37 |
```
|
| 38 |
|
| 39 |
## Untie Embedding Weights
|
| 40 |
+
We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
|
| 41 |
|
| 42 |
```Py
|
| 43 |
from transformers import (
|