Artificial Intelligence 17 min read

How Token‑Shuffle Enables 2048×2048 Autoregressive Image Generation

The article analyzes the Token‑Shuffle method, which reduces visual token redundancy to allow high‑resolution (2048×2048) autoregressive image generation, detailing its architecture, training pipeline, experimental results, efficiency gains, and comparisons with diffusion and other AR models.

AI Frontier Lectures

May 28, 2025

How Token‑Shuffle Enables 2048×2048 Autoregressive Image Generation

Background

Autoregressive (AR) image generation lags behind diffusion because the number of image tokens grows quadratically with resolution, making training and inference inefficient. Existing multimodal large language models (MLLMs) use discrete visual tokens, which further increase token counts and limit generated resolution.

Token‑Shuffle Concept

Token‑Shuffle addresses the token‑count bottleneck by shuffling local groups of visual tokens along the channel dimension, inspired by pixel‑shuffle in super‑resolution. This operation merges neighboring tokens, dramatically reducing the total number of visual tokens while preserving fine‑grained information.

图2：Token‑Shuffle Pipeline：减少 MLLM 中视觉 token 数量的即插即用操作

Architecture Details

Visual tokens are first appended to the LLM vocabulary, creating a multimodal token set. A linear projection reduces the embedding dimension of visual tokens, then a shuffle window (size s = 2) merges tokens, cutting the token count by roughly s². Token‑Unshuffle restores the original token layout after Transformer processing. Residual MLP blocks are added before and after shuffling to refine features.

Experimental Setup

All experiments use a 2.7 B Llama model (hidden size 3072) with 20 autoregressive Transformer blocks. Training proceeds in three stages: (1) 512×512 images without Token‑Shuffle (≈50 B tokens), (2) 1024×1024 images with Token‑Shuffle (≈2 TB tokens), and (3) 2048×2048 images (≈300 B tokens). A z‑loss stabilizes training at the highest resolution.

Results

On the GenAI‑Bench benchmark, Token‑Shuffle achieves a VQAScore of 0.77 on hard prompts, outperforming LlamaGen by 0.18 and diffusion‑based LDM by 0.15. On GenEval, it reaches an overall score of 0.62, indicating strong generation quality for a pure AR model. Human evaluation shows superior text‑image alignment and visual fidelity compared with AR baselines, and competitive performance against diffusion models.

Efficiency Gains

With a shuffle window of 2, training FLOPs and token count drop by roughly four‑fold, and inference time scales linearly with token count thanks to KV‑cache, making high‑resolution AR generation feasible.

图4：Token‑Shuffle 可以二次方地提高效率。对于 shuffle window size s = 2，在训练 FLOP 和 token 数量上实现了大约 4 倍的减少。考虑到在推理过程中使用 KV‑cache，推理时间大致与 token 数成线性关系

Visual Comparison

Qualitative results show that Token‑Shuffle produces high‑resolution images that better follow textual prompts than LlamaGen, while remaining competitive with diffusion models such as LDM and Pixart‑LCM.