Running a 400B Mixture‑of‑Experts LLM on iPhone 17 Pro: Inside Flash‑MoE

The article details how the open‑source Flash‑MoE engine streams a 400‑billion‑parameter Mixture‑of‑Experts language model on an iPhone 17 Pro, achieving interactive‑level token throughput by eliminating Python dependencies, crafting a custom Metal pipeline, and streaming weights directly from SSD.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Running a 400B Mixture‑of‑Experts LLM on iPhone 17 Pro: Inside Flash‑MoE

Flash‑MoE Engine

Flash‑MoE is an open‑source inference engine written entirely in Objective‑C and C, eliminating Python runtimes and heavyweight frameworks such as PyTorch. The source tree for the iOS demo is at https://github.com/Anemll/flash-moe/tree/iOS-App; the main repository is https://github.com/danveloper/flash-moe?tab=readme-ov-file.

Target Model and Hardware

The engine runs the 400 B‑parameter Mixture‑of‑Experts model Qwen3.5‑397B‑A17B on the A19 Pro chip of the iPhone 17 Pro. The same code also executes on Apple Silicon Macs (e.g., M3 Max), where the SSD read bandwidth reaches approximately 17.5 GB/s.

Memory Management

The raw model occupies 209 GB; after 2‑bit expert quantisation it is 120 GB. Flash‑MoE streams parameters from NVMe using massive parallel pread() calls, keeping only 5.5 GB of weights resident in RAM at any moment.

Key Engineering Innovations

Three‑command‑buffer GPU pipeline : a hand‑written Metal shader implements a three‑command‑buffer pipeline, removing CPU‑GPU synchronization overhead.

BLAS‑accelerated linear attention : linear attention for the Gated‑DeltaNet layer is executed via BLAS libraries, improving compute efficiency.

Page‑cache‑only caching strategy : all model‑data caching is delegated to the macOS page cache, eliminating application‑level caches and reducing memory‑compressor thrashing, which yields a 38 % speed increase.

Performance Results

On an Apple M3 Max the engine sustains 5.74 tokens / second and peaks above 7 tokens / second. This demonstrates that a model whose parameter footprint exceeds DRAM capacity by more than four times can run at interactive speed on consumer hardware.

Underlying Research

The approach builds on the arXiv paper “LLM in a flash: Efficient Large Language Model Inference with Limited Memory” (2023) (https://arxiv.org/abs/2312.11514), which proposes storing model parameters on flash storage and pulling them into DRAM on demand. The active‑expert pattern of MoE models is essential for fitting the workload into the iPhone Pro’s 12 GB RAM.

Observations

Inference remains slow and occasionally jittery, but the successful execution of a 400 B model on a mobile device marks a concrete step toward locally‑run large language models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMMixture of ExpertsiPhoneMetalGCDApple SiliconFlash-MoEModel Streaming
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.