FlexVAR: Autoregressive Image Generation with Inpainting and Speed‑Quality Control

FlexVAR replaces residual prediction with direct ground‑truth prediction in visual autoregressive modeling, enabling generation of arbitrary resolutions and aspect ratios, supporting image‑to‑image tasks such as inpainting and upscaling, and offering adjustable inference steps that trade speed for quality while achieving state‑of‑the‑art FID scores.

AIWalker
AIWalker
AIWalker
FlexVAR: Autoregressive Image Generation with Inpainting and Speed‑Quality Control

FlexVAR introduces a new visual autoregressive modeling paradigm that discards residual prediction and directly predicts ground‑truth values at each step, ensuring semantic continuity across scales.

Problem Statement

Existing visual autoregressive models (e.g., VAR) are limited to fixed resolutions and lack flexibility for generating images of varying sizes and aspect ratios.

Residual‑based prediction relies on a fixed step design, restricting adaptability and causing semantic discontinuity across scales.

Proposed Solution

FlexVAR model : predicts the true image value (ground‑truth) instead of a residual, enabling each autoregressive step to generate a plausible image independently.

Scalable VQVAE tokenizer : an extensible tokenizer that quantizes images into multi‑scale tokens and reconstructs them at any resolution.

Expandable 2D positional embedding : a learnable 2D query‑based embedding that can be up‑ or down‑sampled, allowing the model to operate at unseen resolutions and step counts.

Technical Details

The FlexVAR pipeline consists of two main components:

Scalable VQVAE tokenizer : The image is encoded into a latent space, then down‑sampled at multiple random scales to obtain multi‑scale latent features. A codebook of 8,912 learnable vectors (dimension 32) quantizes each latent vector to its nearest code. Decoding reconstructs the image at each scale using the same loss as LlamaGen.

FlexVAR transformer : A transformer models the probability distribution of multi‑scale latent tokens. It operates without residual prediction, directly predicting ground‑truth tokens at each scale. The model supports three sizes (depth 16, 20, 24) trained on ImageNet‑1K 256×256 using 80 GB A100 GPUs, AdamW optimizer (weight decay 0.05), batch size 128, and 20 epochs for the tokenizer.

During training, each step’s scale is randomly sampled (max 10 steps, first step fixed, last step matches the input resolution). Steps are dropped with 5 % probability, limiting the number of steps to 6–10. At inference, the default is 10 steps, but more steps (e.g., 13) can be used to improve quality.

Experiments

ImageNet‑1K 256×256 benchmark : FlexVAR (1.0 B parameters) outperforms VAR and other autoregressive baselines, achieving FID improvements of –0.45, –0.56, and –0.12 across three model scales. With 13 inference steps, FlexVAR reaches FID 2.08 and IS 315, surpassing strong diffusion models.

Zero‑shot transfer to 512×512 : FlexVAR‑d24, trained only up to 256×256, attains performance comparable to a fully supervised VAR 2.3 B model on the 512×512 benchmark.

Ablation studies : Removing residual prediction, using the scalable VQVAE, and adding the expandable positional embedding each contribute to FID reductions (e.g., the PE reduces FID to 3.71). Component‑wise ablations start from the VAR baseline and incrementally add each design choice.

Mamba adaptation : A Mamba‑based FlexVAR variant shows competitive performance with similar parameters, confirming that the ground‑truth prediction paradigm works with linear‑attention architectures, though it was not adopted due to lack of speed advantage.

Analysis of Ground‑Truth Prediction

Training loss curves show FlexVAR converges faster and to a lower loss than VAR, indicating that predicting ground‑truth is more friendly to optimization than residual prediction, which suffers from semantic discontinuity across scales.

Flexibility Demonstrations

Generating arbitrary resolutions: By adjusting the number of inference steps, FlexVAR can synthesize images at resolutions beyond the training limit, maintaining semantic consistency.

Aspect‑ratio control: The model can produce images with varied width‑height ratios (e.g., 1:2) by controlling the aspect ratio at each step.

Step‑length control: Experiments with 6–16 steps show a monotonic improvement in FID and IS; larger models (depth 24) benefit more from additional steps.

Image‑to‑image tasks: Without fine‑tuning, FlexVAR performs in‑painting, out‑painting, and super‑resolution by teacher‑forcing ground‑truth tokens outside the mask and injecting class labels.

Failure cases: When generating images at three times the training resolution, wave‑like artifacts and blurriness appear, likely due to the homogeneous nature of ImageNet‑1K lacking multi‑scale detail.

Conclusion

FlexVAR presents a flexible visual autoregressive framework that eliminates residual prediction, introduces a scalable VQVAE tokenizer, and employs an expandable 2D positional embedding. This design grants the model the ability to generate images of unseen resolutions, aspect ratios, and inference step counts, while achieving state‑of‑the‑art results on ImageNet benchmarks and supporting zero‑shot transfer to various image‑to‑image tasks.

Limitations

High‑resolution generation (≥3× training size) suffers from wave artifacts, suggesting the need for more diverse, multi‑scale training data.

References

[1] FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

image generationinpaintingmultiscaleAutoregressive ModelsflexvarVQVAE
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.