Redis Founder Crafts DeepSeek V4 AI Inference Engine, Node.js Star Applauds
Redis creator Salvatore Sanfilippo (antirez) released DS4, a Metal‑only C inference engine tailored for DeepSeek V4 Flash on high‑end Macs, featuring narrow model focus, 2‑bit quantization, disk‑based KV cache, benchmark speeds around 26 tokens/s, and a dual OpenAI/Anthropic compatible server.
On May 7, antirez (Salvatore Sanfilippo, the original author of Redis) announced DS4, a hand‑written AI inference engine for DeepSeek V4 Flash, with the source code at github.com/antirez/ds4. The project is written in pure C, targets Metal on high‑end Macs, and is deliberately narrow in scope.
It’s not another llama.cpp
DS4’s README states that the engine does one thing only: run DeepSeek V4 Flash’s Metal computation graph, handling model loading, prompt templating, KV state, and the server API. Antirez argues that the local inference ecosystem is fragmented, with many models released rapidly and few receiving thorough attention; therefore DS4 bets on a single model, validates logits, performs long‑context testing, and integrates sufficient agent checks.
Why DeepSeek V4 Flash?
Fewer active parameters → faster : V4 Flash is a Mixture‑of‑Experts model with far fewer active parameters than total parameters.
Adaptive chain‑of‑thought length : The "thinking" mode scales the reasoning segment length with problem difficulty, often using only a fifth of the tokens required by other models.
1 million‑token context window
284 B parameters provide noticeably broader knowledge than 27 B or 35 B models.
Better English and Italian generation , approaching frontier models.
Very small KV cache , even allowing the cache to be stored on disk.
2‑bit quantization without collapse , using a highly asymmetric scheme (IQ2_XXS for up/gate, Q2_K for down) while keeping other components at full precision.
The 2‑bit quantization enables the 284 B model to fit into a MacBook Pro with 128 GB RAM, which is the hardware baseline for DS4.
KV cache as a first‑class citizen on disk
DS4 treats the KV cache as a persistent on‑disk state. When the server starts with --kv-disk-dir and --kv-disk-space-mb, it writes checkpoints at four moments: cold start (after stable prompt prefix), periodic saves during prefill or generation, before eviction (saving the old session), and clean shutdown.
Cache keys are SHA‑1 hashes of token‑id sequences, stored as <sha1>.kv files using ordinary read/write I/O rather than mmap, avoiding extra virtual‑memory mappings.
This design benefits agent‑style clients such as Claude Code, whose prompts can exceed 20 k tokens; after the first run, the on‑disk KV cache can be reused, turning a "can run" engine into a "can use" one.
Performance numbers
Benchmarks run by antirez on two machines with --ctx 32768, thinking disabled, greedy decoding, and generating 256 tokens show:
On a 128 GB MacBook Pro with q2 quantization: ~26 tokens/s overall, ~21 tokens/s with long prompts.
On a Mac Studio, q4 quantization yields slightly more stable speeds.
Robust server implementation
DS4 includes a server compatible with both OpenAI and Anthropic APIs. It can be started with:
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192Supported endpoints are:
GET /v1/models POST /v1/chat/completions POST /v1/completions POST /v1/messagesThe server translates OpenAI fields (messages, temperature, top_p, tools, tool_choice, stream) and converts tool schemas to DeepSeek’s DSML format, while the Anthropic endpoint handles Claude‑style messages.
Configuration examples for Opencode, Pi, and Claude Code are provided; for Claude Code, setting ANTHROPIC_BASE_URL to the local port and mapping model aliases to deepseek‑v4‑flash enables local code‑writing and tool‑use.
Antirez notes that the server processes requests serially on a single Metal worker; concurrent batch inference is not supported, so multi‑user scenarios require caution.
AI‑assisted development disclaimer
“This software was heavily assisted by GPT 5.5; humans led the ideas, testing, and debugging. If you dislike AI‑written code, this software is not for you.”
The README openly acknowledges the substantial role of AI in the project, positioning DS4 as a case study of a solo developer leveraging AI tools to deliver a production‑grade inference engine.
Discussion on “shadow contributors”
In the comment thread, lifcc highlighted GGML’s pervasive influence on new inference engines, noting many “shadow contributors.” Antirez agreed, mentioning similar attribution challenges in Redis, and pointed out that DS4’s LICENSE retains GGML’s copyright and includes an acknowledgments section.
What the project really means
DS4 is not meant to replace llama.cpp or dominate benchmarks; it serves as an experiment in focusing on a single model and hardware stack to achieve a polished, reproducible system: explicit boundaries, disk‑based KV cache, dual‑protocol server, and transparent AI‑assisted development.
For users with a 128 GB Mac interested in DeepSeek V4 Flash, DS4 offers a concrete way to run a 284 B MoE model locally, integrate with Claude Code, and avoid network calls or data sharing.
The README hints that future models may be added as the ecosystem evolves, leaving the community curious about the next target.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
