InfGen Enables Arbitrary-Resolution Image Generation: 4K Images in 7 Seconds, 10× Faster

InfGen introduces a resolution‑agnostic generation paradigm that replaces the VAE decoder in diffusion models, allowing any‑size image synthesis with up to ten‑fold speed gains, achieving 4K outputs in under 7 seconds while improving visual quality.

AIWalker
AIWalker
AIWalker
InfGen Enables Arbitrary-Resolution Image Generation: 4K Images in 7 Seconds, 10× Faster

Problem

Diffusion models require quadratic computation when decoding high‑resolution images, causing >100 s latency for 4K outputs.

InfGen Overview

InfGen is a two‑stage system. A diffusion model first produces a compact latent representation (e.g., 4×64×64). A secondary generator then expands this latent into an image of arbitrary resolution in a single inference step, eliminating the need to retrain the diffusion model.

Arbitrary‑Resolution Decoder Architecture

The decoder builds on a conventional VAE pipeline but inserts a transformer‑based latent generator. The latent tensor provides keys and values; a mask token derived from the target size acts as the query. Multi‑head self‑attention processes the key‑query‑value triplet, and the mask token is up‑sampled to produce the final image.

Implicit Neural Positional Embedding (INPE)

INPE generates continuous positional encodings for a variable number of mask tokens. Coordinates are normalized, mapped onto a unit sphere, transformed into high‑frequency Fourier features, and fed into an implicit neural network that outputs dynamic positional tokens.

Training Pipeline

During training, high‑resolution images are cropped and resized to a fixed size (e.g., 512×512) and encoded by a frozen VAE encoder into a compact latent. InfGen learns to map this latent to arbitrary resolutions. The loss combines adversarial, L1 reconstruction, and LPIPS perceptual terms.

Training Details

Dataset: 10 M LAION‑Aesthetic images >2 MP, filtered to 5 M high‑resolution samples.

Two‑stage training: first stage 0.5 M iterations at 512–1024 resolution (batch 32), second stage 0.1 M iterations up to 2048 resolution (batch 8) on 8 × A100 GPUs.

Optimizer: AdamW, initial LR = 1e‑4, cosine decay.

Loss weights for adversarial and perceptual terms set to 0.1.

Inference Process

At inference, the latent produced by any compatible diffusion model (e.g., DiT, SDXL, SiT, FiT) is fed to InfGen, which generates an image at the requested resolution. An iterative, training‑free up‑sampling scheme can further extrapolate beyond the training resolution, enabling 4K generation from a low‑resolution latent.

Evaluation Metrics

Metrics include FID, sFID, precision, recall, PSNR, and SSIM. For high‑resolution evaluation, images are tiled (e.g., 256×256 patches) to compute FIDp and sFIDp, avoiding down‑sampling artifacts.

Comparison with Other Tokenizers

InfGen’s tokenizers are benchmarked against discrete tokenizers (VQGAN) and continuous tokenizers (SD VAE, SDXL VAE). Despite training on more complex tasks, InfGen matches or exceeds VAE reconstruction performance on both object‑centric and scene‑centric LAION subsets.

Performance Boost for Diffusion Models

InfGen is applied as a plug‑in to models such as DiT‑XL/2, SiT‑XL/2, MaskDiT, MDTv2, and FiTv2. Replacing the VAE decoder enables arbitrary‑resolution output without additional training. Quantitative results show up to 41 % FID improvement for DiT at 4× up‑sampling and average gains of 8‑42 % across five evaluated resolutions.

Visual Comparison

Side‑by‑side images demonstrate that baseline SD1.5 produces blurry textures at high resolution, whereas InfGen yields semantically coherent details (e.g., panda, cat, lion) even at extreme scales.

Comparison with State‑of‑the‑Art Methods

InfGen is compared against training‑free methods like ScaleCraft and trained methods such as UltraPixel and Inf‑DiT. On 2K and 4K resolutions, InfGen + SD1.5 achieves competitive FID, sFID, precision, and recall while generating a 4K image in ~5 s—four times faster than UltraPixel.

Conclusion

InfGen provides an efficient framework for arbitrary‑resolution image synthesis, eliminating the quadratic cost of high‑resolution diffusion decoding. By training a secondary generator in the compact latent space, it decodes low‑resolution latents into any size image without altering the original diffusion model. Experiments confirm superior quality and up to ten‑fold faster inference, with 4K generation in under 7 seconds.

References

[1] InfGen: A Resolution‑Agnostic Paradigm for Scalable Image Synthesis

Image GenerationDiffusion ModelsHigh Performancearbitrary resolutionInfGen
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.