Running DeepSeek V4 on M5 Max: 5 tps Speedup Without Large Memory

Developer Anemll demonstrates that the DS4 IQ2_Q2 version of DeepSeek V4 on an Apple M5 Max gains a 5‑tps throughput boost, using SSD‑streamed MoE sidecar loading to run large models without requiring high memory, and provides full build and execution instructions.

AI Engineering
AI Engineering
AI Engineering
Running DeepSeek V4 on M5 Max: 5 tps Speedup Without Large Memory

Introduction

Developer Anemll reported that DeepSeek DSpark running on Apple M5 Max with the DS4 IQ2_Q2 variant increased inference speed by 5 tokens per second (tps) compared with conventional decoding. The main bottleneck in batch attention is the validator, while acceptance rates remain satisfactory. The code is in the ds4-ssd repository, dspark-attn branch on GitHub, using the original FP8/FP4 model format, requiring MPP 4.1 and macOS 27. The sidecar version runs faster than GUFF mode.

What is ds4-ssd?

ds4-ssd is an alpha branch of antirez’s DwarfStar 4 (ds4) DeepSeek V4 Flash inference engine. It keeps ds4’s lightweight, self‑contained runtime and adds an SSD‑streamed Mixture‑of‑Experts (MoE) sidecar path for Apple Silicon, allowing only the needed expert parameters to be loaded from storage instead of keeping the entire model in memory.

Core features

SSD streaming load : Dense tensors are stored in standard GGUF files; expert routing files reside in a sidecar directory and are paged in via a slot‑bank cache. High‑memory devices can use a fully resident GGUF mode. Apple‑specific optimizations include Metal‑based matmul2d paths (NAX) and ANE‑accelerated MLP pre‑fill routing.

Trimmed branch scope : Retains runtime, Metal shaders, GGUF tools, correctness tests, sidecar tests, and core documentation; removes performance‑analysis scripts, hand‑off notes, session export tools, and ANE probes used only for benchmarking.

Build method

On macOS, run make to produce five executables: ./ds4 – CLI runner ./ds4-server – Local server compatible with OpenAI/Anthropic APIs ./ds4-bench – Throughput benchmark tool ./ds4-eval – Evaluation helper ./ds4-agent – Local coding‑assistant front‑end

The metal directory is required and must not be removed. CUDA sources are inherited from upstream DS4; the alpha version focuses on validating Apple Silicon SSD streaming.

Running modes

1. SSD sidecar mode (low‑memory devices)

Typical Apple users can run without large memory.

Download the pre‑built sidecar package: ./download_model.sh sidecar or download the ~156 GB native MXFP4 package.

Set the environment variable pointing to the sidecar root containing manifest.json and dense/model-dense.gguf:

export DS4_SIDECAR_DIR="$PWD/models/dsv4-iq2xxs-expert-major"

Launch the model:

./ds4 \
  -m "$DS4_SIDECAR_DIR" \
  --moe-slot-bank 8 \
  --ctx 8192 \
  -p "Hello"

The logs show applied sidecar tuning profile, Flash-MoE sidecar loaded, and Flash-MoE slot banks allocated, confirming SSD streaming activation. Initial --moe-slot-bank 8 is recommended; larger values increase memory usage but reduce SSD reads. The --ssd-cache auto option lets the system adjust cache size automatically.

2. Resident sidecar mode (high‑memory Apple Silicon)

Add the --resident flag to load the entire sidecar into RAM, eliminating SSD read latency. The local server exposes the model name deepseek-v4-flash via the GET /v1/models endpoint.

3. DSpark mode (speed‑up specialization)

DSpark is DeepSeek’s speculative decoding technique that markedly raises inference speed.

Download the Flash DSpark draft package:

DS4_DSPARK_DRAFT_DIR="$DS4_DSPARK_DRAFT" ./download_model.sh dspark

Run a baseline test in resident sidecar mode, then add --draft dspark and related flags:

DS4_AGENT_ALLOW_BACKEND_STATS=1 DS4_DSPARK_PERF=1 ./ds4 \
  -m "$DS4_SIDECAR_DIR" \
  --resident \
  --draft dspark \
  --draft-path "$DS4_DSPARK_DRAFT" \
  --draft-verify 4 \
  --draft-scheduler static \
  --temp 0 \
  --nothink \
  -n 1000 \
  -c 4096 \
  -p "Make a game of Space Invader in Pygame"

DSpark currently supports greedy decoding; --temp 0 must be set because non‑zero temperature disables the draft verifier. Setting --draft-verify 4 avoids the slower fifth draft position. On the M5 Max, the DS4 IQ2_Q2 version in DSpark mode outperforms conventional decoding by 5 tps.

4. Resident GGUF mode

High‑memory devices can download the full GGUF model (e.g., the IQ2_XXS quantized version from Huihui) and run it directly; a 96 GB M3 Ultra leaves little headroom and requires closing other applications.

Project attribution

ds4-ssd builds on antirez’s DwarfStar 4, incorporates ideas from llama.cpp and GGML quantization, and leverages Apple’s “LLM in a flash” paper and the flash‑moe project. Apple Neural Engine paths reference Liu Liu’s GPU int8 de‑quantization work and maderix’s public ANE API documentation. Redistribution must retain the repository license and author attributions.

Project repository: https://github.com/Anemll/ds4-ssd/tree/dspark-attn

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI inferenceApple SiliconDeepSeek-V4DS4M5 MaxSSD streaming
AI Engineering
Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.