8 min read

Two LLM Inference Acceleration Projects: A Mac‑Local Engine vs a Data‑Center Engine

This article compares two recent GitHub LLM inference engines—ds4.c, a Metal‑optimized engine for DeepSeek V4 Flash on Apple Silicon Macs, and TokenSpeed, a Python/C++‑based, data‑center‑grade engine for GPU clusters—detailing their design choices, performance numbers, usage instructions, and suitable scenarios.

Geek Labs

May 13, 2026

Two LLM Inference Acceleration Projects: A Mac‑Local Engine vs a Data‑Center Engine

ds4.c – Mac‑local engine for DeepSeek V4 Flash

ds4.c is a single‑file inference engine built on llama.cpp and GGML, customized exclusively for DeepSeek V4 Flash. It provides no generic GGUF loader or framework; its purpose is to extract the highest performance for this model on Apple Silicon Macs.

Motivation

Fast: DeepSeek V4 Flash has fewer parameters but higher capability, yielding faster inference.

Practical thinking mode: Output length is often only one‑fifth of other models, scaling with problem complexity.

~1 million token context length, one of the longest supported.

High KV‑cache compression with disk persistence, enabling long‑context usage on a MacBook.

2‑bit quantization allows the model to run on a 128 GB‑memory MacBook.

Technical features

Metal‑only computation : All arithmetic runs on the Apple GPU, eliminating CPU fallback.

Asymmetric quantization : Only the MoE expert part uses 2‑bit quantization; the rest stays high‑precision to preserve quality.

Disk KV cache : KV state can be persisted to SSD, allowing reuse of previous context without recomputation.

Coding Agent support : Verified with Claude Code, opencode, Pi and other local AI coding tools.

Performance data

MacBook Pro M3 Max 128 GB, q2 quantization: short‑prompt prefill 58.52 t/s, long‑prompt prefill 250.11 t/s, generation 26.68 t/s.

Mac Studio M3 Ultra 512 GB, q2 quantization: short‑prompt prefill 84.43 t/s, long‑prompt prefill 468.03 t/s, generation 36.86 t/s.

Mac Studio M3 Ultra 512 GB, q4 quantization: short‑prompt prefill 78.95 t/s, long‑prompt prefill 448.82 t/s, generation 35.50 t/s.

Usage

# Download 2‑bit quantized model (recommended for 128 GB RAM)
./download_model.sh q2
# Download 4‑bit quantized model (recommended for 256 GB+ RAM)
./download_model.sh q4
# Build
make
# Single‑turn conversation
./ds4 -p "Explain the principle of Redis Stream"
# Interactive mode
./ds4
# Start API server (OpenAI‑compatible)
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv

Supported OpenAI‑compatible endpoints: /v1/models, /v1/chat/completions, /v1/completions, /v1/messages.

TokenSpeed – Data‑center‑grade inference engine

TokenSpeed targets batch inference in GPU‑rich data centers, aiming for TensorRT‑LLM‑level performance with vLLM‑level ease of use. It is a preview version as of May 6.

Core architecture

Modeling Layer : Local‑SPMD design with a static compiler; automatically generates module‑boundary placement annotations for collective communication; users do not write parallel logic.

Scheduler : C++ control plane plus Python execution plane; request lifecycle, KV‑cache ownership, and overlapping schedules are encoded as finite‑state machines; compile‑time type system guarantees safe KV reuse.

Kernels : Pluggable, layered kernel system with a public API and registration table; includes one of the fastest MLA (Multi‑head Latent Attention) implementations on Blackwell GPUs.

Performance comparison

Benchmark on an NVIDIA B200 GPU shows a Pareto curve against Kimi K2.5, demonstrating higher throughput for agentic workloads compared with TensorRT‑LLM.

Current status

Preview version; not recommended for production.

Planned model support: Qwen 3.6, DeepSeek V4, MiniMax M2.7.

Runtime features under development: PD separation, EPLB, KV storage, Mamba cache, VLM, metrics.

Platform optimizations for Hopper MI350.

Documentation and getting started

Quick‑start guide.

Server launch instructions.

Model recipes.

Parallel configuration details.

Project links

https://github.com/antirez/ds4 (⭐ 2.7K)

https://github.com/lightseekorg/tokenspeed (⭐ 808)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance LLM DeepSeek GPU Metal Inference TokenSpeed

Written by

Geek Labs

Daily shares of interesting GitHub open-source projects. AI tools, automation gems, technical tutorials, open-source inspiration.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.