How Redis Creator Built a Metal‑Only Engine to Run DeepSeek V4 Flash at Full Speed on Mac

The ds4.c project, authored by Redis founder Salvatore Sanfilippo, is a Metal‑only C inference engine that uses asymmetric 2‑bit quantization, disk‑based KV caching, and OpenAI/Anthropic‑compatible APIs to achieve usable performance for DeepSeek V4 Flash on high‑end Apple Silicon Macs.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
How Redis Creator Built a Metal‑Only Engine to Run DeepSeek V4 Flash at Full Speed on Mac

DeepSeek V4 Flash on Apple Silicon

DeepSeek released the V4 series on 24 April. The Flash variant has 284 B total parameters, 13 B activation parameters, and a 1 M‑token context window, a size that previously required cloud deployment.

ds4 – a purpose‑built inference engine

Salvatore Sanfilippo (antirez) created ds4.c, a local inference engine written from scratch in C + Metal that runs exclusively on Apple Silicon GPUs.

Implementation

The source consists of three languages: C (55.4 %), Objective‑C (30.2 %), Metal (13.8 %).

Metal‑only design means no runtime, framework, or abstraction layers; the engine runs only on Apple GPUs.

Key technical choices

Non‑asymmetric quantization : Only the MoE routing experts are quantized to 2‑bit (IQ2_XXS for up/gate, Q2_K for down). Shared expert, projection, and routing layers stay at Q8 precision.

KV cache on disk : The KV state after prefilling is written to disk keyed by the SHA‑1 hash of the token‑ID sequence. Subsequent requests with matching prefixes load the cache directly, skipping the prefilling step.

Dual API compatibility layer : Provides OpenAI‑style /v1/chat/completions and Anthropic‑style /v1/messages endpoints with tool‑calling support, allowing existing agent clients to interact with ds4 without modification.

Performance benchmarks

MacBook Pro M3 Max (128 GB RAM) – 2‑bit quantization, 32 K context: 58.52 token/s pre‑fill, 26.68 token/s generation.

Mac Studio M3 Ultra (512 GB RAM) – long prompt (11 709 tokens): 468.03 token/s pre‑fill, 27.39 token/s generation.

These speeds make a 284 B MoE model usable on a personal machine.

Motivation and design philosophy

Generic inference engines abstract away model specifics, which can lead to performance compromises. ds4 focuses narrowly on a single model, validates against official logits, and includes extensive agent‑integration tests to ensure real‑world usability.

The README defines three requirements for a practical local inference stack: an HTTP API, a model‑specific GGUF, and thorough agent‑integration tests – a “full‑stack local inference” approach.

Community reaction and future outlook

Early adopters have run ds4 on 128 GB Macs. Discussions on Hacker News consider the possibility of a dedicated, highly optimized engine per model, trading generality for performance.

Current implementation is Metal‑only; the author mentions potential future CUDA support but emphasizes keeping the project small, fast, and focused.

Known limitations

A macOS bug causes the CPU inference path to crash, so ds4 relies solely on the GPU path.

Project URL: https://github.com/antirez/ds4 Reference: http://invece.org/ Hacker News discussion: https://news.ycombinator.com/item?id=48050751
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

quantizationCMetalApple SiliconLocal InferenceDeepSeek V4ds4
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.