Nvidia Endorses TokenSpeed: A Light‑Speed Agent Inference Engine Built in Two Months

TokenSpeed, an open‑source LLM inference engine designed for agent workloads, delivers TensorRT‑LLM‑level performance and vLLM‑level ease of use, outperforms TensorRT‑LLM by up to 11% throughput and halves latency on speculative decoding, and has earned Nvidia’s public recommendation.

Machine Heart
Machine Heart
Machine Heart
Nvidia Endorses TokenSpeed: A Light‑Speed Agent Inference Engine Built in Two Months

Background

Coding agents such as Claude Code, Codex and Cursor generate sessions that easily exceed 50 K tokens, creating unprecedented compute demands for continuous software‑collaborator workloads.

TokenSpeed Engine Overview

TokenSpeed is an open‑source inference engine designed from first principles for agentic workloads. It combines TensorRT‑LLM‑level performance with vLLM‑level usability and provides the fastest Multi‑head Latent Attention (MLA) kernel on NVIDIA Blackwell.

Modeling Layer

The modeling layer adopts a local SPMD design. Developers annotate I/O placement at module boundaries; a lightweight static compiler then automatically generates the required collective operations, eliminating manual communication code.

Scheduler

The scheduler separates the control plane from the execution plane. The control plane is implemented in C++ as a finite‑state machine that, together with the type system, enforces safe KV‑cache state transitions at compile time. The execution plane is written in Python to enable rapid iteration.

Kernel Layer

Kernels are decoupled from the core engine and treated as first‑class modules. The layer offers a portable API, centralized registration, organized implementation and an extensible plug‑in mechanism for heterogeneous accelerators.

Blackwell Optimizations

On NVIDIA Blackwell, TokenSpeed includes a custom MLA kernel that groups q_seqlen and num_heads to improve Tensor Core utilization, and a finely tuned softmax implementation in the binary prefill kernel. This MLA kernel has been adopted by vLLM.

Performance Evaluation

Evaluation used SWE‑smith traces that reflect real‑world coding‑agent traffic. The goal was to maintain a minimum per‑user TPS (tokens per second) while maximizing per‑GPU TPM (tokens per minute), typically 70 TPS and sometimes 200 TPS or higher.

Against TensorRT‑LLM on NVIDIA Blackwell, TokenSpeed achieved approximately 9 % lower latency at batch size 1 and about 11 % higher throughput near 100 TPS / User. In speculative decoding workloads (batch sizes 4, 8, 16 with long‑prefix KV cache), latency was reduced by nearly 50 % compared to TensorRT‑LLM.

Comparisons of MLA kernels show that TokenSpeed’s prefill kernel outperforms TensorRT‑LLM’s MLA across five typical prefill workloads, and the decode kernel’s query‑axis folding improves Tensor Core utilization.

Project Status

Development began in March 2026. While performance is strong, low‑level components such as PD separation and KV storage are still being merged.

Resources

Blog: https://lightseek.org/blog/lightseek-tokenspeed.html

GitHub: https://github.com/lightseekorg/tokenspeed

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationvLLMLLM InferenceTensorRT-LLMNVIDIA BlackwellAgent workloadsTokenSpeed
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.