Tagged articles

Flash-MoE

3 articles · Page 1 of 1

Jun 6, 2026 · Artificial Intelligence

How a 400B MoE Model Runs on iPhone 17 Pro with Flash‑MoE

The article details how the open‑source Flash‑MoE engine enables the 400B‑parameter Qwen3.5‑397B‑A17B mixture‑of‑experts model to run on an iPhone 17 Pro, achieving about 0.6 tokens per second through a custom Metal pipeline, GCD‑driven SSD streaming, and aggressive caching strategies.

400BFlash-MoELLM Inference

0 likes · 6 min read

How a 400B MoE Model Runs on iPhone 17 Pro with Flash‑MoE

Machine Learning Algorithms & Natural Language Processing

May 3, 2026 · Artificial Intelligence

Running a 400B Mixture‑of‑Experts LLM on iPhone 17 Pro: Inside Flash‑MoE

The article details how the open‑source Flash‑MoE engine streams a 400‑billion‑parameter Mixture‑of‑Experts language model on an iPhone 17 Pro, achieving interactive‑level token throughput by eliminating Python dependencies, crafting a custom Metal pipeline, and streaming weights directly from SSD.

Apple SiliconFlash-MoEGCD

0 likes · 7 min read

Running a 400B Mixture‑of‑Experts LLM on iPhone 17 Pro: Inside Flash‑MoE

Machine Heart

May 1, 2026 · Artificial Intelligence

How a 400B Mixture‑of‑Experts Model Runs on the iPhone 17 Pro

The article details the Flash‑MoE project that streams the 400 billion‑parameter Qwen3.5‑397B‑A17B mixture‑of‑experts model on an iPhone 17 Pro, achieving up to 0.6 tokens per second with a custom Metal‑GPU pipeline, zero‑Python code, and SSD‑backed weight streaming that keeps only 5.5 GB in RAM.

Flash-MoELLMMetal

0 likes · 7 min read

How a 400B Mixture‑of‑Experts Model Runs on the iPhone 17 Pro

Flash-MoE

How a 400B MoE Model Runs on iPhone 17 Pro with Flash‑MoE

Running a 400B Mixture‑of‑Experts LLM on iPhone 17 Pro: Inside Flash‑MoE

How a 400B Mixture‑of‑Experts Model Runs on the iPhone 17 Pro

How a 400B Mixture‑of‑Experts Model Runs on the iPhone 17 Pro