Can LLMs Predict Multiple Tokens at Once? A Deep Dive into Multi‑Token Generation
This article evaluates whether autoregressive large language models can generate several tokens in a single inference step, describing a mask‑based multi‑token prediction framework, gated LoRA adaptation, experimental results on Tulu‑3‑8B showing up to 5.2× speedup, and discusses implications for future research.
Background and Motivation
Recent advances in large language models (LLMs) are driven by massive text corpora and the effectiveness of autoregressive training, where each token predicts the next one. While this paradigm excels during training, inference remains sequential and computationally expensive because each token requires a full model pass. Humans, by contrast, often plan at the sentence level before emitting words.
The authors ask whether LLMs can break this sequential bottleneck and generate multiple tokens in a single inference step, akin to a "time‑jumping" capability.
Proposed Multi‑Token Prediction Framework
Inspired by work from Apple researchers, the paper introduces a framework that enables pretrained autoregressive LLMs to perform multi‑token prediction with minimal changes to the existing training and inference pipeline. The key components are:
Mask Tokens : Special tokens appended to the input sequence (e.g., m1, …, mk) whose embeddings are randomly initialized and added to the model’s embedding table.
Next Token Prediction (NTP) and Mask Token Prediction (MTP) : Standard next‑token prediction is retained (NTP), while the model is trained to predict the mask tokens (MTP) directly.
Gated LoRA Adaptation : During fine‑tuning only LoRA parameters and a lightweight sampler head are updated; the original decoder weights stay frozen. A binary mask routes gradients differently for NTP and MTP, preserving original generation quality.
Sampler Head : A two‑layer perceptron that, at each step, conditions on previously sampled tokens and the model’s latent representation to predict the next token, enabling parallel generation of the masked tokens.
Figure 1 illustrates the overall architecture, showing how the extended sequence Xm = [x1,…,xn,m1,…,mk] is processed.
Experimental Setup
The authors fine‑tuned the Tulu‑3‑8B model (a LLaMA‑3‑based LLM) using supervised fine‑tuning (SFT) and the proposed multi‑token method. They evaluated generation quality on the ARC‑Challenge benchmark via the Harness library and measured inference speed using a self‑speculative decoding algorithm.
Speedup is quantified by the acceptance rate (average number of tokens accepted per decoding step). The theoretical minimum is 1 (standard next‑token prediction) and the maximum is k+1 = 9 when eight mask tokens are used.
Results
Generation Quality : Gated LoRA maintains zero‑shot accuracy on ARC‑Challenge, as shown in Figure 2, while standard LoRA causes a gradual increase in NTP loss, indicating quality degradation.
Inference Acceleration : Across five task domains (knowledge QA, mathematics, programming, dialogue, safety), the multi‑token method achieves 1.5× to 5.2× speedup (Table 1). Programming and math tasks benefit the most due to higher predictability of future tokens.
Ablation Study : The best configuration combines three components: (1) sampler MLP head, (2) LCM loss during training, and (3) quadratic decoding at generation time. Removing any component reduces speedup, as illustrated in Figure 7.
The impact of LoRA rank on both acceleration and memory consumption is visualized in Figure 8, showing diminishing returns beyond a certain rank.
Conclusion and Future Directions
The study demonstrates that autoregressive LLMs can be adapted to predict multiple tokens per inference step with up to 5.2× speedup while preserving generation quality, thanks to mask tokens, gated LoRA, and a lightweight sampler head. Future work may explore integrating this approach during pre‑training or downstream adaptation, and investigating diffusion‑based generative models for multi‑token prediction, which sit between fully autoregressive and fully diffusion methods.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
