Nvidia Endorses TokenSpeed: A Light‑Speed Agent Inference Engine Built in Two Months
TokenSpeed, an open‑source LLM inference engine designed for agent workloads, delivers TensorRT‑LLM‑level performance and vLLM‑level ease of use, outperforms TensorRT‑LLM by up to 11% throughput and halves latency on speculative decoding, and has earned Nvidia’s public recommendation.
Background
Coding agents such as Claude Code, Codex and Cursor generate sessions that easily exceed 50 K tokens, creating unprecedented compute demands for continuous software‑collaborator workloads.
TokenSpeed Engine Overview
TokenSpeed is an open‑source inference engine designed from first principles for agentic workloads. It combines TensorRT‑LLM‑level performance with vLLM‑level usability and provides the fastest Multi‑head Latent Attention (MLA) kernel on NVIDIA Blackwell.
Modeling Layer
The modeling layer adopts a local SPMD design. Developers annotate I/O placement at module boundaries; a lightweight static compiler then automatically generates the required collective operations, eliminating manual communication code.
Scheduler
The scheduler separates the control plane from the execution plane. The control plane is implemented in C++ as a finite‑state machine that, together with the type system, enforces safe KV‑cache state transitions at compile time. The execution plane is written in Python to enable rapid iteration.
Kernel Layer
Kernels are decoupled from the core engine and treated as first‑class modules. The layer offers a portable API, centralized registration, organized implementation and an extensible plug‑in mechanism for heterogeneous accelerators.
Blackwell Optimizations
On NVIDIA Blackwell, TokenSpeed includes a custom MLA kernel that groups q_seqlen and num_heads to improve Tensor Core utilization, and a finely tuned softmax implementation in the binary prefill kernel. This MLA kernel has been adopted by vLLM.
Performance Evaluation
Evaluation used SWE‑smith traces that reflect real‑world coding‑agent traffic. The goal was to maintain a minimum per‑user TPS (tokens per second) while maximizing per‑GPU TPM (tokens per minute), typically 70 TPS and sometimes 200 TPS or higher.
Against TensorRT‑LLM on NVIDIA Blackwell, TokenSpeed achieved approximately 9 % lower latency at batch size 1 and about 11 % higher throughput near 100 TPS / User. In speculative decoding workloads (batch sizes 4, 8, 16 with long‑prefix KV cache), latency was reduced by nearly 50 % compared to TensorRT‑LLM.
Comparisons of MLA kernels show that TokenSpeed’s prefill kernel outperforms TensorRT‑LLM’s MLA across five typical prefill workloads, and the decode kernel’s query‑axis folding improves Tensor Core utilization.
Project Status
Development began in March 2026. While performance is strong, low‑level components such as PD separation and KV storage are still being merged.
Resources
Blog: https://lightseek.org/blog/lightseek-tokenspeed.html
GitHub: https://github.com/lightseekorg/tokenspeed
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
