Industry Insights 9 min read

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.

AI Frontier Lectures

Jul 29, 2025

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

Why a new Speculative Training Framework?

Speculative decoding is the de‑facto method for accelerating LLM inference, but the ecosystem lacks an end‑to‑end training framework that works with trillion‑parameter models and integrates tightly with the SGLang inference engine. Existing solutions either require joint pre‑training (e.g., MTP) or are difficult to integrate.

Core Features

Native support for the latest open‑source architectures, including complex MoE layers and Transformer variants.

Scalable distributed training with Fully Sharded Data Parallel (FSDP) and Tensor Parallelism (TP) for efficient GPU‑cluster utilization.

Memory‑efficient optimizations that reduce overhead even for models with trillions of parameters.

Eagle3 Integration

Eagle3 is a state‑of‑the‑art speculative sampling method that trains a lightweight draft model to predict the token distribution of a larger target model, achieving high acceptance rates and significant performance gains.

Training‑Time Test (TTT) Support

SpecForge encapsulates the complex Training‑Time Test (TTT) mechanism, which simulates multiple generation steps to strengthen the draft model. It provides verified implementations of the specialized attention masks and recursive data loops required by TTT, ensuring correctness and performance without burdening users.

Dual Training Modes: Online & Offline

SpecForge offers two modes for collecting hidden states from the base model:

Online mode : Generates data on‑the‑fly, delivering maximum speed and flexibility, ideal for rapid experiments with limited storage.

Offline mode : Pre‑computes and stores hidden states, guaranteeing reproducibility and high efficiency when ample storage is available.

Extensibility

The framework uses modular interfaces to register new draft and base models easily. It also implements multiple parallel strategies (FSDP, TP) to ensure efficient training of massive models.

Experiments

To validate SpecForge, the team trained Scout and Maverick draft models for LLaMA‑4 on the ShareGPT and UltraChat datasets (320 K samples). The models achieved strong results on industry benchmarks such as MT‑Bench, with the Maverick draft delivering a 2.18× inference speedup.

Performance plots (speculative‑num‑steps on the x‑axis, SGLang’s speculative-eagle-topk=8, speculative-num-draft-tokens=10) show the optimal configurations identified via the bench_speculative script in the SGLang codebase.