Artificial Intelligence 20 min read

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

This article reviews the evolution of Multi‑Token Prediction (MTP) techniques—from early blockwise parallel decoding to Meta's and DeepSeek's implementations—explaining their architectures, training and inference workflows, and the speed‑up gains they offer for large language models.

Baobao Algorithm Notes

Jan 15, 2025

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

0. Introduction

Recently I organized DeepSeek’s technical line and extended my reading on Multi‑Token Prediction (MTP). By examining three key papers, I summarize the origins of MTP, compare industry explorations, and add my own understanding of how MTP improves large language model (LLM) training and inference.

1. Why MTP?

Current dominant LLMs use a decoder‑only architecture that generates tokens one‑by‑one, causing frequent KV‑cache accesses and memory‑bound bottlenecks during both training and inference. MTP aims to replace the token‑by‑token generation with multi‑token generation, thereby increasing sample utilization and accelerating both phases.

2. Explorations of MTP

2.1 Blockwise Parallel Decoding

Google’s 2018 NeurIPS paper introduced a parallel decoding scheme that predicts k tokens simultaneously. The backbone is a pretrained decoder‑only Transformer; each of the k heads predicts a specific future token (next, next‑next, …). Each head contains a shared FFN for widening the logit dimension ( h → 4h) followed by a head‑specific FFN that restores the dimension ( 4h → h) and adds a residual connection to the original logit.

The inference proceeds in three stages:

Predict: All k heads generate k tokens in one forward pass.

Verify: The original sequence is concatenated with the k new tokens to form Pair<sequence_input, label> batches, which are fed to Head_1 for validation.

Accept: The longest prefix of tokens whose Head_1 predictions match the labels is accepted.

Assuming perfect agreement among auxiliary heads, the original m -step generation reduces to 2m/k steps, yielding a 1× speed‑up when k=4. Overlapping the verify stage of step n with the predict stage of step n+1 further reduces steps to 1 + m/k, achieving up to 3× acceleration for k=4.

2.2 Meta’s MTP

Meta’s April 2024 paper extends the idea by adding four parallel prediction heads on top of a shared Transformer backbone. During training, each head predicts tokens t_{i+1}, t_{i+2}, …, t_{i+4}, forcing the model to learn longer dependencies and improving sample efficiency. In inference, the same three‑stage predict‑verify‑accept pipeline can be applied.

Key architectural differences from Blockwise Parallel Decoding:

Each head uses a full Transformer layer (self‑attention + two FFNs) rather than a simple FFN pair.

Training loss is computed for all heads in parallel, accelerating convergence.

3. DeepSeek’s MTP

DeepSeek V3 adopts a similar multi‑head design but introduces a causal chain linking modules and residual connections in the embedding layer.

3.1 Module Details

For each depth D, the model predicts D future tokens. The processing steps for a given head k are:

Normalize the previous layer’s hidden state h_i^{k-1} with RMSNorm.

Normalize the embedding of token t_{i+k}.

Concatenate the two normalized vectors and linearly project with matrix M_k ∈ ℝ^{d×2d} to obtain h_i^{’k}.

Feed h_i^{’k} through a Transformer layer to produce h_i^{k}.

Project h_i^{k} with a shared output matrix OutHead ∈ ℝ^{V×d} and apply softmax to get token probabilities for position i+k.

The training loss is the sum of cross‑entropy losses over all heads, with label indices ranging from 2+k to T+1 (the sequence length plus an EOS token).

3.2 Training

Each head’s loss is computed independently, allowing parallel gradient updates and higher sample utilization, which speeds up convergence.

3.3 Inference

Two inference strategies are described:

Method 1: Remove all MTP heads, reverting to standard next‑token prediction (no speed‑up).

Method 2: Keep the MTP heads and perform self‑speculative decoding using the same three‑stage pipeline as in Blockwise Parallel Decoding, achieving multi‑token generation at inference time.

During inference, free‑running mode replaces teacher forcing: the model feeds its own predicted token back as input for the next step.

4. Summary

The DeepSeek‑V3 MTP design builds on earlier blockwise parallel decoding and Meta’s multi‑head approach, adding causal connections and residual embeddings to improve both training efficiency and inference speed. The article provides detailed architectural diagrams, formula explanations, and practical considerations for implementing MTP in large language models.

5. References

https://arxiv.org/pdf/2412.19437

https://proceedings.neurips.cc/paper_files/paper/2018/file/c4127b9194fe8562c64dc0f5bf2c93bc-Paper.pdf

https://arxiv.org/pdf/2404.19737

https://arxiv.org/pdf/2401.15077

Blockwise Parallel Decoding network diagram

Blockwise Parallel Decoding network diagram (author version)

Meta's MTP network diagram (author version)

MTP multi‑head training sample construction

DeepSeek MTP implementation (author version)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Transformer inference acceleration DeepSeek MTP multi-token prediction

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.