Artificial Intelligence 9 min read

Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

The article presents Token Superposition Training (TST), which temporarily averages embeddings of non‑overlapping token bags and predicts groups of tokens in a first phase before reverting to standard token‑wise prediction, achieving up to 2.5× pre‑training speedup on 10B‑1B MoE models without altering model architecture or inference.

Data Party THU

May 29, 2026

Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes

Overview

The paper introduces Token Superposition Training (TST), a method that speeds up large‑language‑model (LLM) pre‑training without modifying the model’s architecture, parallel strategy, optimizer, tokenizer, or data.

Training Process

TST splits each input sequence of length L into non‑overlapping token bags of size s. During the Superposition Phase , the s token embeddings are averaged to form a single latent representation, which is fed to the model. The model then predicts the next s tokens as a group.

After a preset training‑step ratio r (typically 0.2–0.4), training switches to the Recovery Phase , where standard per‑token prediction resumes. The loss briefly rises during the switch but quickly stabilises.

Input‑side Tensor Folding

By averaging s embeddings, the latent sequence length becomes L/s. To keep per‑step FLOPs constant, the original input length is expanded s ‑fold, allowing the model to process more raw tokens per compute unit.

Output‑side Multi‑Token Prediction

Instead of a single next‑token cross‑entropy, TST replaces it with a Multi‑hot Cross‑Entropy (MCE) loss that distributes the target probability uniformly across the s tokens in the bag. The simplified form averages standard cross‑entropy over the bag, enabling reuse of existing fused CE kernels without extra CUDA code.

Speedup Results

Experiments on 10B‑1B Mixture‑of‑Experts (MoE) models show that TST reaches baseline loss in less than 40% of the training time, corresponding to an approximate 2.5× speedup under equal loss conditions.

Ablation Studies

Under equal compute, TST achieves lower final training loss.

Under equal loss, TST requires shorter training time.

Under equal data, TST’s per‑token compute is lower, leading to slightly weaker performance, indicating the benefit stems from increased token exposure rather than algorithmic superiority.

Varying the bag size s shows a U‑shaped performance curve; optimal s shifts larger for bigger models (e.g., s=16 for 10B‑1B MoE, reducing loss from 2.252 to 2.236).

Alternative losses such as BCE or hinge perform worse than MCE, and attempts to restore intra‑bag token order provide no stable gains.

Representation Alignment

Both input and output superposition mechanisms contribute independently; combining them yields additive gains without interference, confirming orthogonal benefits.

Resetting the embedding layer or LM head at the start of the Recovery Phase erases all accumulated gains, highlighting the necessity of maintaining representation alignment across phases.

Conclusions

TST trades additional data consumption for lower training loss at the same compute budget, making it attractive when compute is scarce but data is abundant. While the method reduces latent sequence length, it does not yet demonstrate improved long‑context capability, leaving that as future work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts training efficiency Multi-Token Prediction LLM pretraining MCE loss Token Superposition Training

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.