SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×
SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.
Why a new Speculative Training Framework?
Speculative decoding is the de‑facto method for accelerating LLM inference, but the ecosystem lacks an end‑to‑end training framework that works with trillion‑parameter models and integrates tightly with the SGLang inference engine. Existing solutions either require joint pre‑training (e.g., MTP) or are difficult to integrate.
Core Features
Native support for the latest open‑source architectures, including complex MoE layers and Transformer variants.
Scalable distributed training with Fully Sharded Data Parallel (FSDP) and Tensor Parallelism (TP) for efficient GPU‑cluster utilization.
Memory‑efficient optimizations that reduce overhead even for models with trillions of parameters.
Eagle3 Integration
Eagle3 is a state‑of‑the‑art speculative sampling method that trains a lightweight draft model to predict the token distribution of a larger target model, achieving high acceptance rates and significant performance gains.
Training‑Time Test (TTT) Support
SpecForge encapsulates the complex Training‑Time Test (TTT) mechanism, which simulates multiple generation steps to strengthen the draft model. It provides verified implementations of the specialized attention masks and recursive data loops required by TTT, ensuring correctness and performance without burdening users.
Dual Training Modes: Online & Offline
SpecForge offers two modes for collecting hidden states from the base model:
Online mode : Generates data on‑the‑fly, delivering maximum speed and flexibility, ideal for rapid experiments with limited storage.
Offline mode : Pre‑computes and stores hidden states, guaranteeing reproducibility and high efficiency when ample storage is available.
Extensibility
The framework uses modular interfaces to register new draft and base models easily. It also implements multiple parallel strategies (FSDP, TP) to ensure efficient training of massive models.
Experiments
To validate SpecForge, the team trained Scout and Maverick draft models for LLaMA‑4 on the ShareGPT and UltraChat datasets (320 K samples). The models achieved strong results on industry benchmarks such as MT‑Bench, with the Maverick draft delivering a 2.18× inference speedup.
Performance plots (speculative‑num‑steps on the x‑axis, SGLang’s speculative-eagle-topk=8, speculative-num-draft-tokens=10) show the optimal configurations identified via the bench_speculative script in the SGLang codebase.
Resources
All source code, including TTT and data processing, is available on GitHub:
https://github.com/sgl-project/SpecForge
Pre‑trained draft models can be downloaded from Hugging Face:
LLaMA‑4 Scout: https://huggingface.co/lmsys/sglang-EAGLE3-Llama-4-Scout-17B-16E-Instruct-v1
LLaMA‑4 Maverick: https://huggingface.co/lmsys/sglang-EAGLE3-Llama-4-Maverick-17B-128E-Instruct-v1
Roadmap
Support more model architectures, including Kimi K2 and Qwen‑3 MoE.
Integrate vision‑language models (VLM) into SpecForge.
Further improve training efficiency with better parallel strategies and kernel optimizations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
