SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

Why a new Speculative Training Framework?

Speculative decoding is the de‑facto method for accelerating LLM inference, but the ecosystem lacks an end‑to‑end training framework that works with trillion‑parameter models and integrates tightly with the SGLang inference engine. Existing solutions either require joint pre‑training (e.g., MTP) or are difficult to integrate.

Core Features

Native support for the latest open‑source architectures, including complex MoE layers and Transformer variants.

Scalable distributed training with Fully Sharded Data Parallel (FSDP) and Tensor Parallelism (TP) for efficient GPU‑cluster utilization.

Memory‑efficient optimizations that reduce overhead even for models with trillions of parameters.

Eagle3 Integration

Eagle3 is a state‑of‑the‑art speculative sampling method that trains a lightweight draft model to predict the token distribution of a larger target model, achieving high acceptance rates and significant performance gains.

SpecForge training flow diagram
SpecForge training flow diagram

Training‑Time Test (TTT) Support

SpecForge encapsulates the complex Training‑Time Test (TTT) mechanism, which simulates multiple generation steps to strengthen the draft model. It provides verified implementations of the specialized attention masks and recursive data loops required by TTT, ensuring correctness and performance without burdening users.

Dual Training Modes: Online & Offline

SpecForge offers two modes for collecting hidden states from the base model:

Online mode : Generates data on‑the‑fly, delivering maximum speed and flexibility, ideal for rapid experiments with limited storage.

Offline mode : Pre‑computes and stores hidden states, guaranteeing reproducibility and high efficiency when ample storage is available.

Online vs Offline training comparison
Online vs Offline training comparison

Extensibility

The framework uses modular interfaces to register new draft and base models easily. It also implements multiple parallel strategies (FSDP, TP) to ensure efficient training of massive models.

Experiments

To validate SpecForge, the team trained Scout and Maverick draft models for LLaMA‑4 on the ShareGPT and UltraChat datasets (320 K samples). The models achieved strong results on industry benchmarks such as MT‑Bench, with the Maverick draft delivering a 2.18× inference speedup.

Performance plots (speculative‑num‑steps on the x‑axis, SGLang’s speculative-eagle-topk=8, speculative-num-draft-tokens=10) show the optimal configurations identified via the bench_speculative script in the SGLang codebase.

Performance benchmark chart 1
Performance benchmark chart 1
Performance benchmark chart 2
Performance benchmark chart 2

Resources

All source code, including TTT and data processing, is available on GitHub:

https://github.com/sgl-project/SpecForge

Pre‑trained draft models can be downloaded from Hugging Face:

LLaMA‑4 Scout: https://huggingface.co/lmsys/sglang-EAGLE3-Llama-4-Scout-17B-16E-Instruct-v1

LLaMA‑4 Maverick: https://huggingface.co/lmsys/sglang-EAGLE3-Llama-4-Maverick-17B-128E-Instruct-v1

Roadmap

Support more model architectures, including Kimi K2 and Qwen‑3 MoE.

Integrate vision‑language models (VLM) into SpecForge.

Further improve training efficiency with better parallel strategies and kernel optimizations.

Inference Accelerationopen-sourceSpeculative SamplingTraining FrameworkAI performance
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.