Boost LLM Inference 1.9× with AngelSlim’s Speculative Decoding (Eagle3)

AngelSlim introduces a system‑wide speculative decoding framework called Eagle3 that combines lightweight draft models with parallel verification by large models, delivering up to 1.9× faster inference across LLM, vision‑language, and speech tasks while remaining open‑source and deployment‑ready.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
Boost LLM Inference 1.9× with AngelSlim’s Speculative Decoding (Eagle3)

Overview

As large‑scale models move from research prototypes to production, inference cost, latency, and stability become the main bottlenecks, especially in long‑context, high‑concurrency, and multimodal scenarios. Traditional optimization via model compression or hardware scaling is reaching diminishing returns, prompting a re‑examination of the decoding process itself.

AngelSlim + Speculative Decoding

Speculative decoding accelerates inference by letting a small “draft” model generate several candidate tokens while the target large model validates them in parallel. Tencent’s AngelSlim integrates this idea into a train‑and‑deploy pipeline named Eagle3, supporting LLM, vision‑language, and speech models and achieving up to 1.9× speed‑up in real deployments.

Key Highlights

Full‑modal speculative‑decoding training covering text‑to‑text, vision‑language, and speech.

Designed for deployment: models trained with AngelSlim can be directly used in vLLM, Sglang, and similar serving frameworks.

Training‑time‑test mechanism lets the Eagle3 model learn to use its own predictions during training.

Core Training Components

1. Data Processing Module

Provides a stable, reusable data foundation for multimodal speculative‑decoding training.

Data resampling : Re‑sample out‑of‑distribution datasets to create in‑distribution training data.

Data preprocessing : Standardize text, image, and audio inputs into token IDs and loss masks; map draft‑model vocabularies.

Hidden‑feature extraction : Generate hidden representations from the processed token IDs.

2. Model Module

Offers a unified TargetModel interface that abstracts model loading, weight management, forward passes, and hidden‑state extraction, enabling low‑cost extension to new model back‑ends without altering the trainer or core algorithms.

3. Trainer Module

Training modes : Online mode (generates hidden states on‑the‑fly, suitable for small models) and offline mode (pre‑computes hidden states, suited for large models with limited GPU memory).

Key logic : Implements training‑time‑test where the Eagle3 model experiences its own multi‑step generation during training.

Checkpoint support : Saves and restores draft‑model parameters, optimizer/LR‑scheduler state, and training progress.

Practice and Deployment

Quick Start

After installing AngelSlim, run the following commands from the repository root to start an Eagle3 training session:

# Start vLLM service
bash scripts/speculative/run_vllm_server.sh
# Generate training data (optional if you already have SFT data)
bash scripts/speculative/generate_data_for_target_model.sh
# Begin online training
bash scripts/speculative/train_eagle3_online.sh

The first two commands prepare data; they can be skipped if suitable data already exists. The final command launches Eagle3 training. Detailed multimodal training and deployment guides are available for LLM, VLM, and audio (ASR/TTS) at the following URLs:

https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html | https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html | https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_asr_eagle.html | https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html

Performance

Evaluations on vLLM across code, math, instruction‑following, text generation, and multimodal understanding tasks show that with num_speculative_tokens=2 or 4, the Eagle3 model can handle input lengths of 1.8–3.5× longer and achieve 1.4–1.9× speed‑up.

Code and Model Links

AngelSlim open‑source repository: https://github.com/Tencent/AngelSlim

Hugging‑Face collection of Eagle3 models and weights: https://huggingface.co/collections/AngelSlim/eagle3

Future Plans

Road‑map items include offline hidden‑state generation for vLLM to further cut data‑building and training costs, systematic training‑speed optimizations, and algorithmic research on deeper fusion of multimodal (text, vision, speech) inputs within Eagle3 to broaden speculative decoding’s applicability.

multimodal AISpeculative Decodingopen-sourceAngelSlimEagle3LLM Acceleration
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.