Running a 400B Mixture‑of‑Experts LLM on iPhone 17 Pro: Inside Flash‑MoE
The article details how the open‑source Flash‑MoE engine streams a 400‑billion‑parameter Mixture‑of‑Experts language model on an iPhone 17 Pro, achieving interactive‑level token throughput by eliminating Python dependencies, crafting a custom Metal pipeline, and streaming weights directly from SSD.
Flash‑MoE Engine
Flash‑MoE is an open‑source inference engine written entirely in Objective‑C and C, eliminating Python runtimes and heavyweight frameworks such as PyTorch. The source tree for the iOS demo is at https://github.com/Anemll/flash-moe/tree/iOS-App; the main repository is https://github.com/danveloper/flash-moe?tab=readme-ov-file.
Target Model and Hardware
The engine runs the 400 B‑parameter Mixture‑of‑Experts model Qwen3.5‑397B‑A17B on the A19 Pro chip of the iPhone 17 Pro. The same code also executes on Apple Silicon Macs (e.g., M3 Max), where the SSD read bandwidth reaches approximately 17.5 GB/s.
Memory Management
The raw model occupies 209 GB; after 2‑bit expert quantisation it is 120 GB. Flash‑MoE streams parameters from NVMe using massive parallel pread() calls, keeping only 5.5 GB of weights resident in RAM at any moment.
Key Engineering Innovations
Three‑command‑buffer GPU pipeline : a hand‑written Metal shader implements a three‑command‑buffer pipeline, removing CPU‑GPU synchronization overhead.
BLAS‑accelerated linear attention : linear attention for the Gated‑DeltaNet layer is executed via BLAS libraries, improving compute efficiency.
Page‑cache‑only caching strategy : all model‑data caching is delegated to the macOS page cache, eliminating application‑level caches and reducing memory‑compressor thrashing, which yields a 38 % speed increase.
Performance Results
On an Apple M3 Max the engine sustains 5.74 tokens / second and peaks above 7 tokens / second. This demonstrates that a model whose parameter footprint exceeds DRAM capacity by more than four times can run at interactive speed on consumer hardware.
Underlying Research
The approach builds on the arXiv paper “LLM in a flash: Efficient Large Language Model Inference with Limited Memory” (2023) (https://arxiv.org/abs/2312.11514), which proposes storing model parameters on flash storage and pulling them into DRAM on demand. The active‑expert pattern of MoE models is essential for fitting the workload into the iPhone Pro’s 12 GB RAM.
Observations
Inference remains slow and occasionally jittery, but the successful execution of a 400 B model on a mobile device marks a concrete step toward locally‑run large language models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
