Qwen3-1.7B — Apple Core AI Export (iPhone GPU)
Pre-converted Apple Core AI (.aimodelc) bundle of Qwen/Qwen3-1.7B, produced
with the coreai-models export recipe and presented without
modification. Hashes are embedded so the artifact is a reproducible reference point.
This fills the missing 1.7B rung in the dense Qwen3 Core AI line
(0.6b / 4b / 8b already exist) — and 1.7B is a meaningful rung: it is the
largest dense Qwen3 that still invokes on LiteRT-LM iOS, measured alongside this
bundle in a neutral cross-runtime benchmark.
Why GPU-only (no ANE bundle)
The 0.6B repo ships both an ANE (static-shape, palettized) and a GPU (dynamic INT4) bundle. At 1.7B the ANE export is omitted on purpose: the static-shape ANE bundle loads but fails to invoke on iOS 27 — a full benchmark run window produced no output — the same ANE invoke ceiling the 4B static export hits. Rather than ship a bundle that does not run, only the GPU (dynamic INT4) export is published here; it invokes and decodes cleanly on device. (Core AI's GPU path is unaffected — it runs 0.6B/1.7B/4B on iPhone; the ANE path is the one that tops out below 1.7B.)
Bundle
| Path | Target | Compute unit | Quant | On disk |
|---|---|---|---|---|
ios-gpu/ |
iPhone (h18p) | GPU (coreai-pipelined) |
dynamic INT4 | 939 MB |
Embedded tokenizer (Qwen/Qwen3-1.7B), 40960 max context. iOS bundles are already
AOT-compiled (.aimodelc) for the iPhone 17 Pro GPU target.
Measured — iPhone 17 Pro (iPhone18,1 · iOS 27.0)
Greedy, 128-token budget for short-chat (n=3, iso-cold), 256 for quality. Every figure traces to raw JSONL in the companion benchmark.
| Metric | Value |
|---|---|
| Decode | 44.7 tok/s cold → ~66 warm |
| TTFT (warm) | ~29 ms |
| Prefill | ~750 tok/s |
| Peak RAM | 248 MB |
| Quality (8 checkable Qs) | 8 / 8, not degenerate |
The kernel cache persists across launches, so only the first launch after a fresh install is genuinely cold (44.7); once primed it holds ~66 tok/s — the steady state a user actually sees, and the fastest of the runtimes measured at this size (vs MLX Q4 ~66 at 1095 MB, LiteRT-LM int8 ~30 at 512 MB).
Usage
iOS bundles are AOT-compiled and side-loaded into the app container; load via Core AI
with the embedded tokenizer and coreai-pipelined GPU engine. See the coreai-models
recipe and CoreAIChatMac for an interactive harness.
Provenance
Converted from Qwen/Qwen3-1.7B (Apache-2.0).
Quant: dynamic INT4 (linear). Producer: coreai-build-3600.67.5.8.1.