Boost LLM Inference 1.9× with AngelSlim’s Speculative Decoding (Eagle3)
AngelSlim introduces a system‑wide speculative decoding framework called Eagle3 that combines lightweight draft models with parallel verification by large models, delivering up to 1.9× faster inference across LLM, vision‑language, and speech tasks while remaining open‑source and deployment‑ready.
Overview
As large‑scale models move from research prototypes to production, inference cost, latency, and stability become the main bottlenecks, especially in long‑context, high‑concurrency, and multimodal scenarios. Traditional optimization via model compression or hardware scaling is reaching diminishing returns, prompting a re‑examination of the decoding process itself.
AngelSlim + Speculative Decoding
Speculative decoding accelerates inference by letting a small “draft” model generate several candidate tokens while the target large model validates them in parallel. Tencent’s AngelSlim integrates this idea into a train‑and‑deploy pipeline named Eagle3, supporting LLM, vision‑language, and speech models and achieving up to 1.9× speed‑up in real deployments.
Key Highlights
Full‑modal speculative‑decoding training covering text‑to‑text, vision‑language, and speech.
Designed for deployment: models trained with AngelSlim can be directly used in vLLM, Sglang, and similar serving frameworks.
Training‑time‑test mechanism lets the Eagle3 model learn to use its own predictions during training.
Core Training Components
1. Data Processing Module
Provides a stable, reusable data foundation for multimodal speculative‑decoding training.
Data resampling : Re‑sample out‑of‑distribution datasets to create in‑distribution training data.
Data preprocessing : Standardize text, image, and audio inputs into token IDs and loss masks; map draft‑model vocabularies.
Hidden‑feature extraction : Generate hidden representations from the processed token IDs.
2. Model Module
Offers a unified TargetModel interface that abstracts model loading, weight management, forward passes, and hidden‑state extraction, enabling low‑cost extension to new model back‑ends without altering the trainer or core algorithms.
3. Trainer Module
Training modes : Online mode (generates hidden states on‑the‑fly, suitable for small models) and offline mode (pre‑computes hidden states, suited for large models with limited GPU memory).
Key logic : Implements training‑time‑test where the Eagle3 model experiences its own multi‑step generation during training.
Checkpoint support : Saves and restores draft‑model parameters, optimizer/LR‑scheduler state, and training progress.
Practice and Deployment
Quick Start
After installing AngelSlim, run the following commands from the repository root to start an Eagle3 training session:
# Start vLLM service
bash scripts/speculative/run_vllm_server.sh
# Generate training data (optional if you already have SFT data)
bash scripts/speculative/generate_data_for_target_model.sh
# Begin online training
bash scripts/speculative/train_eagle3_online.shThe first two commands prepare data; they can be skipped if suitable data already exists. The final command launches Eagle3 training. Detailed multimodal training and deployment guides are available for LLM, VLM, and audio (ASR/TTS) at the following URLs:
https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/eagle.html | https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/vlm_eagle.html | https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_asr_eagle.html | https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle/audio_tts_eagle.html
Performance
Evaluations on vLLM across code, math, instruction‑following, text generation, and multimodal understanding tasks show that with num_speculative_tokens=2 or 4, the Eagle3 model can handle input lengths of 1.8–3.5× longer and achieve 1.4–1.9× speed‑up.
Code and Model Links
AngelSlim open‑source repository: https://github.com/Tencent/AngelSlim
Hugging‑Face collection of Eagle3 models and weights: https://huggingface.co/collections/AngelSlim/eagle3
Future Plans
Road‑map items include offline hidden‑state generation for vLLM to further cut data‑building and training costs, systematic training‑speed optimizations, and algorithmic research on deeper fusion of multimodal (text, vision, speech) inputs within Eagle3 to broaden speculative decoding’s applicability.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
