How a 400B Mixture‑of‑Experts Model Runs on the iPhone 17 Pro

The article details the Flash‑MoE project that streams the 400 billion‑parameter Qwen3.5‑397B‑A17B mixture‑of‑experts model on an iPhone 17 Pro, achieving up to 0.6 tokens per second with a custom Metal‑GPU pipeline, zero‑Python code, and SSD‑backed weight streaming that keeps only 5.5 GB in RAM.

Machine Heart
Machine Heart
Machine Heart
How a 400B Mixture‑of‑Experts Model Runs on the iPhone 17 Pro

The author introduces a surprising demo where a 400‑billion‑parameter mixture‑of‑experts (MoE) language model, Qwen3.5‑397B‑A17B, runs directly on the Apple iPhone 17 Pro (A19 Pro chip). Although the output speed is only 0.6 tokens / s, the achievement demonstrates that consumer‑grade silicon can host models far larger than its RAM.

The implementation originates from the open‑source Flash‑MoE project (GitHub: https://github.com/Anemll/flash-moe/tree/iOS-App). Flash‑MoE was co‑developed by Daniel Woods and the Claude Code 4.6 team and builds on Apple’s 2023 research paper “LLM in a flash: Efficient Large Language Model Inference with Limited Memory” (arXiv:2312.11514). The core idea follows the Intel Optane concept: when DRAM cannot hold the whole model, use fast SSD storage as an extension.

Key engineering choices that enable the demo are:

Zero Python dependency : the engine is written entirely in Objective‑C and C, avoiding heavyweight frameworks such as PyTorch.

Custom Metal pipeline : developers hand‑crafted Metal shaders and a three‑command‑buffer GPU pipeline, eliminating CPU‑GPU synchronization stalls.

GCD parallel pread() : Apple’s Grand Central Dispatch launches many concurrent pread() calls, saturating the SSD’s sequential read bandwidth (≈ 17.5 GB/s on an M3 Max).

The model’s raw size is 209 GB; after 2‑bit expert quantization it shrinks to 120 GB. Flash‑MoE streams the weights from NVMe, keeping at most 5.5 GB in memory at any moment. This streaming approach follows the paper “Flash‑MoE: Streaming a 397B Parameter Mixture‑of‑Experts Model from NVMe at 5.7 Tokens/Second on Consumer Hardware”.

Three major innovations are highlighted:

Three‑command‑buffer GPU pipeline : removes CPU‑GPU sync overhead.

BLAS‑accelerated linear attention : applied to the Gated‑DeltaNet layer.

Counter‑intuitive cache strategy : disables application‑level caches, delegating all expert data to macOS page cache, which reduces memory‑compressor thrashing and yields a 38 % speed gain.

Performance on an Apple M3 Max chip reaches a sustained 5.74 tokens / s and peaks above 7 tokens / s, proving that a model whose parameter footprint exceeds DRAM capacity by more than four times can still run interactively on consumer hardware.

The author notes that the original developers did not anticipate a 400 B model fitting on an iPhone, yet the demonstration pushes the frontier of “local large‑model” deployment, despite the slow and occasionally choppy user experience.

Overall, the work showcases how careful low‑level engineering—zero‑Python code, Metal‑GPU pipelines, aggressive SSD streaming, and expert‑aware quantization—can bridge the gap between massive LLMs and the limited memory of mobile devices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMMixture of ExpertsiPhoneMetalFlash-MoEModel Streaming
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.