Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

The article presents Token Superposition Training (TST), which temporarily averages embeddings of non‑overlapping token bags and predicts groups of tokens in a first phase before reverting to standard token‑wise prediction, achieving up to 2.5× pre‑training speedup on 10B‑1B MoE models without altering model architecture or inference.

Data Party THU
Data Party THU
Data Party THU
Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

Overview

The paper introduces Token Superposition Training (TST), a method that speeds up large‑language‑model (LLM) pre‑training without modifying the model’s architecture, parallel strategy, optimizer, tokenizer, or data.

Training Process

TST splits each input sequence of length L into non‑overlapping token bags of size s. During the Superposition Phase , the s token embeddings are averaged to form a single latent representation, which is fed to the model. The model then predicts the next s tokens as a group.

After a preset training‑step ratio r (typically 0.2–0.4), training switches to the Recovery Phase , where standard per‑token prediction resumes. The loss briefly rises during the switch but quickly stabilises.

Input‑side Tensor Folding

By averaging s embeddings, the latent sequence length becomes L/s. To keep per‑step FLOPs constant, the original input length is expanded s ‑fold, allowing the model to process more raw tokens per compute unit.

Tensor folding illustration
Tensor folding illustration

Output‑side Multi‑Token Prediction

Instead of a single next‑token cross‑entropy, TST replaces it with a Multi‑hot Cross‑Entropy (MCE) loss that distributes the target probability uniformly across the s tokens in the bag. The simplified form averages standard cross‑entropy over the bag, enabling reuse of existing fused CE kernels without extra CUDA code.

MCE loss formulation
MCE loss formulation

Speedup Results

Experiments on 10B‑1B Mixture‑of‑Experts (MoE) models show that TST reaches baseline loss in less than 40% of the training time, corresponding to an approximate 2.5× speedup under equal loss conditions.

Speedup across model scales
Speedup across model scales

Ablation Studies

Under equal compute, TST achieves lower final training loss.

Under equal loss, TST requires shorter training time.

Under equal data, TST’s per‑token compute is lower, leading to slightly weaker performance, indicating the benefit stems from increased token exposure rather than algorithmic superiority.

Varying the bag size s shows a U‑shaped performance curve; optimal s shifts larger for bigger models (e.g., s=16 for 10B‑1B MoE, reducing loss from 2.252 to 2.236).

Alternative losses such as BCE or hinge perform worse than MCE, and attempts to restore intra‑bag token order provide no stable gains.

Representation Alignment

Both input and output superposition mechanisms contribute independently; combining them yields additive gains without interference, confirming orthogonal benefits.

Resetting the embedding layer or LM head at the start of the Recovery Phase erases all accumulated gains, highlighting the necessity of maintaining representation alignment across phases.

Conclusions

TST trades additional data consumption for lower training loss at the same compute budget, making it attractive when compute is scarce but data is abundant. While the method reduces latent sequence length, it does not yet demonstrate improved long‑context capability, leaving that as future work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Mixture of ExpertsTraining efficiencyMulti‑token predictionLLM pretrainingMCE lossToken Superposition Training
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.