6 min read

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

Tencent Tech

Oct 27, 2025

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

Large Reasoning Models (LRMs) such as DeepSeek‑R1 generate long chain‑of‑thought (CoT) sequences that improve reasoning ability but also increase inference cost and latency. Excessively long CoT leads to semantic redundancy—"thinking too much"—which does not improve accuracy and becomes a major bottleneck.

To address this, Tencent engineers introduced SpecExit, an inference‑speedup technique that seamlessly fuses Early Exit with Speculative Decoding . The method extracts low‑cost signals from a lightweight draft model—confidence, progress, and remaining length—embedded in its hidden states, and uses them to decide when to stop generation early.

SpecExit extends the draft model’s multi‑token prediction (MTP) module with a lightweight multi‑task head that jointly predicts future tokens and the three signals:

Confidence: reliability of the current prediction.

Progress: how much of the reasoning chain has been completed.

Remain: estimated number of tokens left until completion.

Two additional mechanisms ensure stable and natural early stopping:

Signal smoothing via exponential weighted moving average (EWMA) to dampen fluctuations.

Semantic boundary control that only allows exit at natural sentence or paragraph boundaries (e.g., ".\n\n" or logical connectors like "But", "So").

Experiments on DeepSeek‑R1‑Distill‑Llama‑8B show a 66% reduction in generated token length and up to 2.5× end‑to‑end speedup, while preserving accuracy (only ~0.1% drop). Compared with other early‑exit methods, SpecExit not only shortens the CoT but also improves overall latency.

SpecExit thus demonstrates that hidden states of draft models contain rich inference information, offering a new direction for optimizing large‑model reasoning workloads.

Paper: https://arxiv.org/abs/2509.24248

Code: https://github.com/Tencent/AngelSlim