MIGM-Shortcut: Learning Controlled Latent Dynamics to Speed Up Masked Image Generation
The paper introduces MIGM-Shortcut, a self‑supervised method that learns controlled latent‑state dynamics to bypass redundant bidirectional attention in Masked Image Generation Models, achieving over 4× speed‑up on state‑of‑the‑art multimodal diffusion models like Lumina‑DiMOO while preserving image quality.
Background and Motivation
Masked Image Generation Model (MIGM) enables flexible generation order compared with autoregressive and continuous diffusion models, but its multi‑step bidirectional attention incurs heavy computational redundancy, limiting speed gains. Prior cache‑based accelerations such as ReCAP (NeurIPS 2025) and the native ML‑Cache in Lumina‑DiMOO achieve at most ~2× speed‑up.
Continuous diffusion work shows that network feature trajectories are often smooth, allowing past features to approximate the next step. However, MIGM starts from fully masked sequences, so the sampling randomness must be injected at each step; without the new token information the trajectory cannot be predicted accurately.
MIGM‑Shortcut Architecture
The final‑layer feature of the base MIGM is treated as a hidden state. A lightweight shortcut model receives the current step’s feature and the newly sampled token (with positional encoding) and predicts the next step’s feature, bypassing the expensive base model.
Design details:
One cross‑attention layer where key/value come from the new token.
One self‑attention layer on the updated feature.
A bottleneck projection reduces the input dimension before projecting back to the original feature space, based on the assumption that updates are low‑rank because they are driven by a small number of new tokens.
Training uses a simple mean‑squared error loss between predicted and ground‑truth features.
Experiments
MaskGIT
Shortcut model size: ~8.6 M parameters (1/20 of the base).
Trained for 5 hours on four NVIDIA H200 GPUs.
Evaluation on ImageNet‑512 (FID) shows that the shortcut not only speeds up inference but, when combined with a larger total number of generation steps, can further improve image quality.
Lumina‑DiMOO (state‑of‑the‑art multimodal diffusion)
Shortcut model size: ~220 M parameters (1/37 of the base).
Trained for 12 hours on four NVIDIA H200 GPUs.
Achieves >4× speed‑up on text‑to‑image generation while preserving quality metrics: ImageReward, CLIPScore, UniPercept‑IQA, and human preference studies.
Ablation Studies
Importance of sampling information : Replacing cross‑attention with self‑attention (removing token input) causes a severe performance drop, producing overly smooth images because the model is forced to learn an expectation over all possible sampling outcomes.
Model complexity trade‑off : Both excessively small and overly large shortcut models degrade the speed‑quality Pareto frontier, confirming that the bottleneck must balance expressive power and computational cost.
Conclusion
Computational redundancy in MIGM remains an under‑explored bottleneck. MIGM‑Shortcut demonstrates that learning controlled latent‑state dynamics provides a viable route to substantial acceleration and connects to emerging concepts such as latent reasoning in large language models. The approach also offers new analytical insights for the broader masked generation community.
Project page: https://kaiwen-zhu.github.io/research/migm-shortcut
Code repository: https://github.com/Kaiwen-Zhu/MIGM-Shortcut
Paper: https://arxiv.org/abs/2602.23996
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
