How a 400B MoE Model Runs on iPhone 17 Pro with Flash‑MoE

The article details how the open‑source Flash‑MoE engine enables the 400B‑parameter Qwen3.5‑397B‑A17B mixture‑of‑experts model to run on an iPhone 17 Pro, achieving about 0.6 tokens per second through a custom Metal pipeline, GCD‑driven SSD streaming, and aggressive caching strategies.

Data Party THU
Data Party THU
Data Party THU
How a 400B MoE Model Runs on iPhone 17 Pro with Flash‑MoE

The demo shows a 400 billion‑parameter mixture‑of‑experts (MoE) model, Qwen3.5‑397B‑A17B, executing on an iPhone 17 Pro’s A19 Pro chip at roughly 0.6 tokens per second, a result the author describes as “incredible.”

The demonstration originates from the open‑source Flash‑MoE project (https://github.com/Anemll/flash-moe/tree/iOS-App), which implements the model inference engine without any Python dependencies, using only Objective‑C and C.

Flash‑MoE replaces modern AI frameworks with a handcrafted Metal shader pipeline. It introduces a three‑command‑buffer GPU pipeline that eliminates CPU‑GPU synchronization overhead.

To feed the model, the engine leverages Apple’s Grand Central Dispatch (GCD) to launch many concurrent pread() calls, streaming data from the SSD at about 17.5 GB/s on an M3 Max device.

The full model occupies 209 GB (120 GB after 2‑bit expert quantization). By streaming from flash storage, only 5.5 GB of weights reside in RAM at any moment.

Image
Image

Key innovations include:

Three‑command‑buffer GPU pipeline that removes CPU‑GPU sync costs.

BLAS‑accelerated linear attention for the Gated‑DeltaNet layer.

A counter‑intuitive caching strategy that disables application‑level caches and lets macOS page cache exclusively manage expert data, reducing memory thrashing and delivering a 38 % speed increase.

On an Apple M3 Max chip the system sustains 5.74 tokens/s and peaks above 7 tokens/s, marking the first proof that a model exceeding DRAM capacity by more than four times can run at interactive speeds on consumer hardware.

The work builds on Apple’s 2023 research paper “LLM in a flash: Efficient Large Language Model Inference with Limited Memory” (arXiv 2312.11514) and the Flash‑MoE paper “Streaming a 397B Parameter Mixture‑of‑Experts Model from NVMe at 5.7 Tokens/Second on Consumer Hardware.” Both describe storing model parameters on flash and paging them in as needed, exploiting active expert sparsity.

Although the iPhone Pro’s RAM is limited to 12 GB, the active portion of the model fits, allowing inference despite being slow and occasionally choppy. The achievement pushes the vision of locally running massive LLMs forward.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Mobile AIMixture of ExpertsiPhoneLLM inferenceMetalFlash-MoE400B
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.