Can LLMs Predict Multiple Tokens at Once? A Deep Dive into Multi‑Token Generation

This article evaluates whether autoregressive large language models can generate several tokens in a single inference step, describing a mask‑based multi‑token prediction framework, gated LoRA adaptation, experimental results on Tulu‑3‑8B showing up to 5.2× speedup, and discusses implications for future research.

Data Party THU
Data Party THU
Data Party THU
Can LLMs Predict Multiple Tokens at Once? A Deep Dive into Multi‑Token Generation

Background and Motivation

Recent advances in large language models (LLMs) are driven by massive text corpora and the effectiveness of autoregressive training, where each token predicts the next one. While this paradigm excels during training, inference remains sequential and computationally expensive because each token requires a full model pass. Humans, by contrast, often plan at the sentence level before emitting words.

The authors ask whether LLMs can break this sequential bottleneck and generate multiple tokens in a single inference step, akin to a "time‑jumping" capability.

Proposed Multi‑Token Prediction Framework

Inspired by work from Apple researchers, the paper introduces a framework that enables pretrained autoregressive LLMs to perform multi‑token prediction with minimal changes to the existing training and inference pipeline. The key components are:

Mask Tokens : Special tokens appended to the input sequence (e.g., m1, …, mk) whose embeddings are randomly initialized and added to the model’s embedding table.

Next Token Prediction (NTP) and Mask Token Prediction (MTP) : Standard next‑token prediction is retained (NTP), while the model is trained to predict the mask tokens (MTP) directly.

Gated LoRA Adaptation : During fine‑tuning only LoRA parameters and a lightweight sampler head are updated; the original decoder weights stay frozen. A binary mask routes gradients differently for NTP and MTP, preserving original generation quality.

Sampler Head : A two‑layer perceptron that, at each step, conditions on previously sampled tokens and the model’s latent representation to predict the next token, enabling parallel generation of the masked tokens.

Figure 1 illustrates the overall architecture, showing how the extended sequence Xm = [x1,…,xn,m1,…,mk] is processed.

Figure 1: MTP model architecture
Figure 1: MTP model architecture

Experimental Setup

The authors fine‑tuned the Tulu‑3‑8B model (a LLaMA‑3‑based LLM) using supervised fine‑tuning (SFT) and the proposed multi‑token method. They evaluated generation quality on the ARC‑Challenge benchmark via the Harness library and measured inference speed using a self‑speculative decoding algorithm.

Speedup is quantified by the acceptance rate (average number of tokens accepted per decoding step). The theoretical minimum is 1 (standard next‑token prediction) and the maximum is k+1 = 9 when eight mask tokens are used.

Results

Generation Quality : Gated LoRA maintains zero‑shot accuracy on ARC‑Challenge, as shown in Figure 2, while standard LoRA causes a gradual increase in NTP loss, indicating quality degradation.

Figure 2: Generation quality comparison
Figure 2: Generation quality comparison

Inference Acceleration : Across five task domains (knowledge QA, mathematics, programming, dialogue, safety), the multi‑token method achieves 1.5× to 5.2× speedup (Table 1). Programming and math tasks benefit the most due to higher predictability of future tokens.

Table 1: Acceleration across domains
Table 1: Acceleration across domains

Ablation Study : The best configuration combines three components: (1) sampler MLP head, (2) LCM loss during training, and (3) quadratic decoding at generation time. Removing any component reduces speedup, as illustrated in Figure 7.

Figure 7: Ablation results
Figure 7: Ablation results

The impact of LoRA rank on both acceleration and memory consumption is visualized in Figure 8, showing diminishing returns beyond a certain rank.

Figure 8: LoRA rank effects
Figure 8: LoRA rank effects

Conclusion and Future Directions

The study demonstrates that autoregressive LLMs can be adapted to predict multiple tokens per inference step with up to 5.2× speedup while preserving generation quality, thanks to mask tokens, gated LoRA, and a lightweight sampler head. Future work may explore integrating this approach during pre‑training or downstream adaptation, and investigating diffusion‑based generative models for multi‑token prediction, which sit between fully autoregressive and fully diffusion methods.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMspeculative decodingAI efficiencygated LoRAMulti-token generation
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.