How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

Tencent Tech
Tencent Tech
Tencent Tech
How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

Large Reasoning Models (LRMs) such as DeepSeek‑R1 generate long chain‑of‑thought (CoT) sequences that improve reasoning ability but also increase inference cost and latency. Excessively long CoT leads to semantic redundancy—"thinking too much"—which does not improve accuracy and becomes a major bottleneck.

To address this, Tencent engineers introduced SpecExit, an inference‑speedup technique that seamlessly fuses Early Exit with Speculative Decoding . The method extracts low‑cost signals from a lightweight draft model—confidence, progress, and remaining length—embedded in its hidden states, and uses them to decide when to stop generation early.

SpecExit extends the draft model’s multi‑token prediction (MTP) module with a lightweight multi‑task head that jointly predicts future tokens and the three signals:

Confidence: reliability of the current prediction.

Progress: how much of the reasoning chain has been completed.

Remain: estimated number of tokens left until completion.

Two additional mechanisms ensure stable and natural early stopping:

Signal smoothing via exponential weighted moving average (EWMA) to dampen fluctuations.

Semantic boundary control that only allows exit at natural sentence or paragraph boundaries (e.g., ".\n\n" or logical connectors like "But", "So").

Experiments on DeepSeek‑R1‑Distill‑Llama‑8B show a 66% reduction in generated token length and up to 2.5× end‑to‑end speedup, while preserving accuracy (only ~0.1% drop). Compared with other early‑exit methods, SpecExit not only shortens the CoT but also improves overall latency.

SpecExit thus demonstrates that hidden states of draft models contain rich inference information, offering a new direction for optimizing large‑model reasoning workloads.

Paper: https://arxiv.org/abs/2509.24248

Code: https://github.com/Tencent/AngelSlim

AISpeculative DecodingInference Accelerationearly exitlarge reasoning models
Tencent Tech
Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.