Introducing FAR: A Frequency‑Progressive Autoregressive Paradigm for Image Generation

The paper presents FAR, a frequency‑aware autoregressive framework that predicts image tokens from low‑frequency to high‑frequency components using a continuous tokenizer, and demonstrates its efficiency and quality on ImageNet and text‑to‑image benchmarks compared with existing AR and VAR methods.

AIWalker
AIWalker
AIWalker
Introducing FAR: A Frequency‑Progressive Autoregressive Paradigm for Image Generation

Problem and Motivation

Autoregressive (AR) image generation traditionally follows a two‑stage pipeline: (1) vector‑quantize the image into discrete tokens, (2) predict the next token in raster‑scan order. Raster scanning breaks the spatial locality prior of images and incurs O(N²) inference cost, while vector quantization introduces a mismatch between continuous image statistics and discrete token modeling. These limitations motivate a redesign of both the tokenizer format and the regression direction for visual data.

Frequency‑Aware Autoregressive (FAR) Paradigm

FAR treats an image as a hierarchy of L frequency bands obtained by Fourier transform F and inverse transform F^{-1}. Low‑frequency bands capture overall brightness, color and coarse shape; high‑frequency bands encode edges, texture and fine details. Generation proceeds from the lowest band to the highest, predicting all tokens of a band simultaneously with bidirectional attention. This respects causal ordering (each band depends only on previously generated lower bands) and reduces inference complexity from quadratic to linear in the number of tokens.

Continuous‑Token Diffusion Loss

The Multi‑Autoregressive (MAR) framework is adopted to train a diffusion model that measures the distance between generated continuous tokens z and ground‑truth tokens z_0. The diffusion loss is defined as the expected mean‑squared error between the predicted noise and the true noise under a noise schedule \beta_t (DDPM [18]). During inference a token sampler follows the reverse diffusion process, iteratively denoising from a noisy start to obtain the next frequency‑level tokens.

Frequency‑Aware Training Techniques

Loss weighting. High‑frequency levels receive larger weights via a sinusoidal curve w_l = \sin\left(\frac{\pi l}{L}\right), ensuring balanced learning across the spectrum.

Masking schedule. A frequency‑aware mask randomly hides a proportion m_l of tokens at level l, where m_l linearly decays from 0.7 (lowest level) to 0 (highest). This reduces training cost and improves sample diversity.

Diffusion‑step schedule. Early (low‑frequency) levels use fewer reverse‑diffusion steps (e.g., 100 steps) while later levels use the full 1000‑step schedule, accelerating inference without sacrificing quality.

Model Architecture

Three transformer sizes are explored:

FAR‑B: 1.72 B parameters

FAR‑L: larger (size omitted)

FAR‑XL: largest (size omitted)

A small denoising MLP (e.g., hidden size 1024) suffices because only token distributions are modeled; widening the MLP yields marginal quality gains. For text‑to‑image, a Qwen‑2 encoder (1.5 B parameters) provides cross‑attention conditioning.

Experimental Setup

Datasets. ImageNet for class‑conditional generation; JourneyDB and internal data for text‑to‑image. All images are center‑cropped and resized to a fixed resolution (e.g., 256×256).

Training hyper‑parameters. AdamW with weight decay 0.02, EMA decay 0.9999 (class‑conditional) or 0.99 (text‑conditional). Batch size 1024 for 400 epochs (class‑conditional) and batch size 512 for 100 epochs (text‑conditional). Diffusion noise schedule follows DDPM with 1000 steps.

Low‑pass filters. Two variants were tested: spatial down‑up sampling and Fourier‑domain low‑pass filtering; both gave comparable results, so the spatial version is used by default.

Results

Class‑Conditional Generation

Scaling the transformer size consistently improves all metrics (FID, IS, top‑1 accuracy, recall). For example, FAR‑B (1.72 B) achieves lower FID and higher IS than smaller baselines; enlarging to FAR‑L/XL further reduces FID by ~10 % and raises IS by ~0.3 points (Table 2). Sampling steps are flexible: using fewer than the maximum number of steps degrades quality monotonically, while 10–15 steps already match or exceed MAR, VAR and VQGAN baselines with far fewer inference operations.

Text‑to‑Image Generation

With Qwen‑2 as the text encoder, FAR matches or surpasses DALL‑E, CogView2 and LlamaGen on MS‑COCO and GenEval while requiring substantially less compute. Qualitative examples (Figures 1, 5) show coherent compositions and fine details generated within 10 diffusion steps.

Ablation Studies

Three ablations were performed:

S1 – Simplified diffusion loss. Modeling only the low‑frequency distribution and filtering to obtain high‑frequency targets reduces optimization difficulty and still yields competitive IS/FID.

S2 – Frequency‑aware masking. The mask schedule improves training efficiency by roughly X % (the paper reports a noticeable speed‑up) and increases diversity, reflected in higher recall.

S3 – Frequency‑aware loss weighting. Sinusoidal weighting raises IS and lowers FID compared with uniform weighting, confirming the benefit of emphasizing high‑frequency errors.

Combined, these components produce the best trade‑off between sample quality, diversity and inference speed.

Conclusion

FAR introduces a spectral‑dependency regression direction for AR image generation and integrates a continuous tokenizer via diffusion loss. By generating images band‑by‑band, FAR preserves spatial locality, achieves linear‑time inference, and scales effectively across model sizes. Extensive experiments on ImageNet (class‑conditional) and JourneyDB (text‑to‑image) demonstrate that FAR attains state‑of‑the‑art quality with far fewer diffusion steps than prior AR, VAR or MAR approaches.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Image GenerationAI researchAutoregressive Modelscontinuous tokenizerFARfrequency-aware
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.