How VARSR Redefines Image Super‑Resolution with Autoregressive Modeling
The VARSR algorithm introduces autoregressive modeling to image super‑resolution, leveraging prefix tokens, scale‑aligned rotary positional encodings, quantization error correction, and image‑quality‑guided diffusion to achieve faster inference and superior visual fidelity, as demonstrated by extensive ICML‑2025 experiments.
Background
Autoregressive (AR) modeling, which predicts the next token, has driven breakthroughs in large language models such as GPT and LLaMA. Inspired by this success, the research community has begun applying AR to vision tasks, showing great potential in image generation (e.g., DALL·E, GPT‑4o). Compared with diffusion models, AR more effectively models multimodal information and avoids the randomness of noise sampling, resulting in more stable outputs.
To bring these advantages to image/video super‑resolution (SR), Kuaishou Audio‑Video Technology together with Tsinghua University proposed the VARSR algorithm. The related paper "Visual Autoregressive Modeling for Image Super‑Resolution" has been accepted to ICML 2025.
Method
1. Prefix Tokens – Low‑resolution images are encoded into token maps that are fixed as prefix tokens for all subsequent scale predictions, improving semantic fusion efficiency and consistency.
2. Scale‑Aligned Rotary Positional Encoding – Tokens at each scale are encoded with rotary positional encodings aligned to their original 2‑D image coordinates, preserving spatial structure across scales.
3. Quantization‑Error Corrector – Discretizing images into tokens causes detail loss; a lightweight diffusion model predicts the quantization error and adds the residual to the final scale prediction, enhancing texture fidelity.
4. Image‑Based Classifier‑Free Guidance (CFG) – During training, images are split into high‑quality and low‑quality groups, each assigned a positive or negative embedding. At inference, a guidance scale balances realism and fidelity, enabling the model to generate higher‑quality outputs.
5. Large‑Scale High‑Quality Dataset – Over 4 million high‑quality images were collected and filtered to pre‑train a class‑to‑image base model (VAVQE) and then fine‑tuned with VARSR.
Experiments
VARSR was evaluated on DIV2K, RealSR, DRealSR and other standard SR benchmarks. Compared with representative GAN‑based and diffusion‑based methods, VARSR achieved:
Best scores on no‑reference IQA metrics (MANIQA, CLIPIQA, MUSIQ), indicating higher perceptual quality.
Competitive or superior PSNR, SSIM, and DISTS scores, especially on real‑world datasets.
Inference time of 0.59 s per image (10.1 % of typical diffusion methods) with only 10 autoregressive scale steps.
Accurate reconstruction of fine structures such as traffic‑light colors, building textures, and animal fur.
Qualitative comparisons (Fig. 4‑5) show VARSR’s base model generates clearer textures than open‑source baselines.
Conclusion and Outlook
The proposed VARSR demonstrates that autoregressive generation, combined with prefix token fusion, scale‑aligned rotary encodings, quantization‑error correction, and image‑quality‑aware guidance, can achieve state‑of‑the‑art performance on image super‑resolution while being significantly faster than diffusion approaches. VARSR is already deployed in Kuaishou’s video processing pipeline, enhancing video clarity and user experience, and the team plans to extend the technique to broader multimedia applications.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
