How a 400B MoE Model Runs on iPhone 17 Pro with Flash‑MoE
The article details how the open‑source Flash‑MoE engine enables the 400B‑parameter Qwen3.5‑397B‑A17B mixture‑of‑experts model to run on an iPhone 17 Pro, achieving about 0.6 tokens per second through a custom Metal pipeline, GCD‑driven SSD streaming, and aggressive caching strategies.
The demo shows a 400 billion‑parameter mixture‑of‑experts (MoE) model, Qwen3.5‑397B‑A17B, executing on an iPhone 17 Pro’s A19 Pro chip at roughly 0.6 tokens per second, a result the author describes as “incredible.”
The demonstration originates from the open‑source Flash‑MoE project (https://github.com/Anemll/flash-moe/tree/iOS-App), which implements the model inference engine without any Python dependencies, using only Objective‑C and C.
Flash‑MoE replaces modern AI frameworks with a handcrafted Metal shader pipeline. It introduces a three‑command‑buffer GPU pipeline that eliminates CPU‑GPU synchronization overhead.
To feed the model, the engine leverages Apple’s Grand Central Dispatch (GCD) to launch many concurrent pread() calls, streaming data from the SSD at about 17.5 GB/s on an M3 Max device.
The full model occupies 209 GB (120 GB after 2‑bit expert quantization). By streaming from flash storage, only 5.5 GB of weights reside in RAM at any moment.
Key innovations include:
Three‑command‑buffer GPU pipeline that removes CPU‑GPU sync costs.
BLAS‑accelerated linear attention for the Gated‑DeltaNet layer.
A counter‑intuitive caching strategy that disables application‑level caches and lets macOS page cache exclusively manage expert data, reducing memory thrashing and delivering a 38 % speed increase.
On an Apple M3 Max chip the system sustains 5.74 tokens/s and peaks above 7 tokens/s, marking the first proof that a model exceeding DRAM capacity by more than four times can run at interactive speeds on consumer hardware.
The work builds on Apple’s 2023 research paper “LLM in a flash: Efficient Large Language Model Inference with Limited Memory” (arXiv 2312.11514) and the Flash‑MoE paper “Streaming a 397B Parameter Mixture‑of‑Experts Model from NVMe at 5.7 Tokens/Second on Consumer Hardware.” Both describe storing model parameters on flash and paging them in as needed, exploiting active expert sparsity.
Although the iPhone Pro’s RAM is limited to 12 GB, the active portion of the model fits, allowing inference despite being slow and occasionally choppy. The achievement pushes the vision of locally running massive LLMs forward.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
