Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Token Superposition Training (TST) speeds up large‑language‑model pre‑training by up to 2.5× without altering model architecture or compute budget, using a superposition phase that averages token embeddings into bags and predicts groups of tokens, followed by a standard recovery phase, as demonstrated on 10B‑parameter MoE and smaller models.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Token Superposition Training Accelerates LLM Pre‑training 2.5× Without Changing Architecture

Standard LLM pre‑training processes a fixed‑length token sequence per step.

To expose the model to more text under the same compute, common tricks change the tokenizer, attention, add MoE, or extra prediction heads, which also alter the model architecture and make it hard to isolate the source of gains.

Token Superposition Training (TST) proposes to keep the model architecture, parallel strategy, optimizer, tokenizer and data unchanged, and only modify the training forward pass in the first half of pre‑training.

During the Superposition Phase, a contiguous sequence of L tokens is split into non‑overlapping token bags of size s. The s embeddings in each bag are averaged to form a single latent token representation, reducing the latent sequence length to L/s. To keep the per‑step FLOPs constant, the original input length is expanded by a factor of s, so the model still processes the same amount of computation while seeing s‑times more raw tokens.

In the same phase the model predicts the next group of s tokens instead of a single next token. The standard cross‑entropy loss is replaced by a multi‑hot cross‑entropy (MCE) that distributes the target probability uniformly over the s labels.

After a preset training‑step ratio r (typically 0.2–0.4), training switches to the Recovery Phase, where the standard token‑by‑token prediction resumes. The loss briefly spikes and then stabilises.

Experiments on a 10B‑parameter MoE model (10B‑A1B) show that TST reaches the baseline final loss in less than 40 % of the training time, i.e., a 2.5× speed‑up.

Scaling experiments on 270M, 600M, 3B dense models and the 10B MoE model (using TorchTitan + FSDP) confirm the trend. For the 3B model, TST achieves lower final loss at equal compute, reaches the same loss in less time, but with the same data volume each original token receives fewer compute steps, leading to slightly weaker performance under that constraint.

Hyper‑parameter sweeps reveal that the step‑ratio r is relatively stable between 0.2 and 0.4, while the optimal bag size s follows a U‑shaped curve that shifts right as model size grows; the 10B experiments use s = 16, reducing loss from 2.252 (baseline) to 2.236.

Ablation of alternative bag‑losses (BCE, hinge) and designs that try to recover intra‑bag order show they are inferior to MCE, indicating that preserving bag order is not the key to TST’s gains.

Weighted MCE, which applies a distance‑based decay to the multi‑hot targets, further lowers loss on the DCLM dataset where token mutual information decays as a power law with distance.

Separate ablations of input‑side superposition (latent compression) and output‑side superposition (group prediction) each improve over the baseline, and their combination yields the largest gain, confirming the mechanisms are orthogonal.

Unlike many multi‑stage training pipelines, TST does not introduce adapters or alignment phases; the underlying representation remains identical across phases. Resetting the input embedding or LM head at the start of the Recovery Phase erases all earlier gains, demonstrating the necessity of representation alignment.

In conclusion, TST trades extra data consumption for lower training loss at the same compute budget, making it attractive when compute is scarce but data is abundant. The method does not yet evaluate the impact on long‑context capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MoEMulti-token PredictionTraining ThroughputLLM PretrainingMCE LossToken Superposition
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.