Token Superposition Training: 2.5× Faster LLM Pre‑training Without Model Changes
The article presents Token Superposition Training (TST), which temporarily averages embeddings of non‑overlapping token bags and predicts groups of tokens in a first phase before reverting to standard token‑wise prediction, achieving up to 2.5× pre‑training speedup on 10B‑1B MoE models without altering model architecture or inference.
Overview
The paper introduces Token Superposition Training (TST), a method that speeds up large‑language‑model (LLM) pre‑training without modifying the model’s architecture, parallel strategy, optimizer, tokenizer, or data.
Training Process
TST splits each input sequence of length L into non‑overlapping token bags of size s. During the Superposition Phase , the s token embeddings are averaged to form a single latent representation, which is fed to the model. The model then predicts the next s tokens as a group.
After a preset training‑step ratio r (typically 0.2–0.4), training switches to the Recovery Phase , where standard per‑token prediction resumes. The loss briefly rises during the switch but quickly stabilises.
Input‑side Tensor Folding
By averaging s embeddings, the latent sequence length becomes L/s. To keep per‑step FLOPs constant, the original input length is expanded s ‑fold, allowing the model to process more raw tokens per compute unit.
Output‑side Multi‑Token Prediction
Instead of a single next‑token cross‑entropy, TST replaces it with a Multi‑hot Cross‑Entropy (MCE) loss that distributes the target probability uniformly across the s tokens in the bag. The simplified form averages standard cross‑entropy over the bag, enabling reuse of existing fused CE kernels without extra CUDA code.
Speedup Results
Experiments on 10B‑1B Mixture‑of‑Experts (MoE) models show that TST reaches baseline loss in less than 40% of the training time, corresponding to an approximate 2.5× speedup under equal loss conditions.
Ablation Studies
Under equal compute, TST achieves lower final training loss.
Under equal loss, TST requires shorter training time.
Under equal data, TST’s per‑token compute is lower, leading to slightly weaker performance, indicating the benefit stems from increased token exposure rather than algorithmic superiority.
Varying the bag size s shows a U‑shaped performance curve; optimal s shifts larger for bigger models (e.g., s=16 for 10B‑1B MoE, reducing loss from 2.252 to 2.236).
Alternative losses such as BCE or hinge perform worse than MCE, and attempts to restore intra‑bag token order provide no stable gains.
Representation Alignment
Both input and output superposition mechanisms contribute independently; combining them yields additive gains without interference, confirming orthogonal benefits.
Resetting the embedding layer or LM head at the start of the Recovery Phase erases all accumulated gains, highlighting the necessity of maintaining representation alignment across phases.
Conclusions
TST trades additional data consumption for lower training loss at the same compute budget, making it attractive when compute is scarce but data is abundant. While the method reduces latent sequence length, it does not yet demonstrate improved long‑context capability, leaving that as future work.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
