Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Nous Research introduces Token Superposition Training, which groups tokens into bags, averages their embeddings, and predicts token groups without altering model architecture or adding compute, achieving up to 2.5× faster pre‑training while maintaining standard inference.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Boosting LLM Pre‑training 2.5× Without Architecture Changes or Extra Compute

Standard LLM pre‑training processes one token sequence per step. To expose the model to more text without extra compute, common tricks modify the tokenizer, attention, add MoE, or extra prediction heads, which entangle architecture changes with throughput gains.

Nous Research proposes Token Superposition Training (TST) that leaves model architecture, parallelism, optimizer, tokenizer and data untouched. In the first half of training, it groups s consecutive tokens into a non‑overlapping “token bag”, averages their embeddings to obtain a single latent representation, and asks the model to predict the next bag of tokens. In the second half it switches back to ordinary next‑token prediction.

During the superposition phase the latent sequence length becomes 1/s of the original, so the model processes fewer positions per step. To keep per‑step FLOPs constant, the original input length is expanded by s , allowing the model to consume s times more raw tokens under the same compute budget.

The output side is adjusted similarly: the model predicts a group of s tokens instead of a single token, and the standard cross‑entropy is replaced by a multi‑hot cross‑entropy (MCE) that distributes the target probability uniformly over the s labels.

Experiments on a 10B‑parameter MoE model show that TST reaches the baseline final loss in less than 40 % of the training time, i.e., a 2.5× speed‑up. Larger‑scale tests on 270 M, 600 M, 3 B dense models and the 10B‑A1B MoE model (using s=16 ) confirm consistent gains: lower final loss for the same compute, shorter wall‑clock time for the same loss, and a U‑shaped relationship between bag size s and loss.

Alternative designs such as binary‑cross‑entropy, hinge loss, or attempts to recover intra‑bag order did not match the default MCE, indicating that preserving the bag’s order is not the source of the benefit. Weighted MCE, which accounts for the power‑law decay of token mutual information with distance, further reduces loss on the DCLM dataset.

Ablation studies separating input‑side superposition (latent compression) from output‑side superposition (gradient signal change) show that each alone improves over the baseline and that their combination yields additive gains, confirming that the two mechanisms are orthogonal.

Crucially, TST does not introduce adapters or alignment phases; the underlying representations remain identical across phases. Randomly re‑initialising the embedding layer or LM head at the start of the recovery phase erases all earlier gains, demonstrating the necessity of representation alignment.

Overall, TST trades extra data consumption for lower training loss at fixed compute, making it attractive when compute is scarce but data plentiful. The method leaves the final autoregressive inference unchanged and requires no new kernels, as the MCE can be implemented by averaging standard CE results over the bag.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MoETraining efficiencyMulti-token PredictionLLM PretrainingMCE LossNous ResearchToken Superposition
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.