DC‑AE: A 128× Downsampling Autoencoder that Super‑Charges High‑Resolution Diffusion Models
DC‑AE introduces Residual Autoencoding and Decoupled High‑Resolution Adaptation to achieve up to 128× spatial compression in autoencoders, preserving reconstruction quality while delivering roughly 19× inference and 18× training speedups for high‑resolution diffusion models, as demonstrated on ImageNet and other benchmarks.
TL;DR
DC‑AE proposes a new autoencoder that can downsample images by up to 128×, dramatically accelerating high‑resolution diffusion models without sacrificing reconstruction fidelity. It achieves up to 19.1× inference speedup and 17.9× training speedup on ImageNet 512×512, thanks to two key techniques: Residual Autoencoding and Decoupled High‑Resolution Adaptation.
Background
Latent Diffusion Models (LDM) rely on an autoencoder with an 8× spatial compression (f8) to map images into a latent space, reducing the cost of diffusion. For high‑resolution generation, higher compression ratios (e.g., 64×, 128×) are desirable but traditionally cause severe reconstruction quality loss.
Challenges of High‑Compression Autoencoders
Experiments on ImageNet 256×256 show that increasing the compression from f8 to f64 raises rFID from 0.90 to 28.3, indicating a dramatic drop in quality. Even when the high‑compression autoencoder inherits the low‑compression architecture as a sub‑network, optimization becomes much harder, and rFID remains far above the f8 baseline.
DC‑AE Design
1. Residual Autoencoding
DC‑AE adds a non‑parameterized shortcut to both downsample and upsample blocks. The shortcut performs a space‑to‑channel (or channel‑to‑space) reshaping followed by channel averaging or duplication, matching channel dimensions without using an identity mapping.
Downsample block: Space‑to‑Channel + Channel‑Average shortcut.
Upsample block: Channel‑to‑Space + Channel‑Duplicating shortcut.
Figures illustrate the shortcut shapes and demonstrate that this residual path markedly improves rFID for high‑compression settings.
2. Decoupled High‑Resolution Adaptation
Training a high‑compression autoencoder directly on high‑resolution images is prohibitively expensive and GAN‑based losses become unstable. DC‑AE splits training into three phases to mitigate these issues:
Phase 1: Standard reconstruction loss on low‑resolution data.
Phase 2: High‑resolution latent adaptation (no GAN loss) that fine‑tunes only the encoder head and decoder input layers, cutting memory usage from 154 GB to 68 GB.
Phase 3: Local refinement with GAN loss applied solely to the decoder head on low‑resolution data, improving local detail while keeping training cost low.
This decoupling alleviates the “generalization penalty” of high‑compression autoencoders while keeping training affordable.
Experimental Setup
The authors train the autoencoders on a mixed dataset comprising ImageNet, SAM, Mapillary Vistas, and FFHQ. DC‑AE is evaluated with Diffusion Transformer models (DiT, U‑ViT, U‑SiT) at 512×512, 1024×1024, and 2048×2048 resolutions. Baselines include SD‑VAE‑f8 and SD‑VAE‑f64.
Results
On ImageNet 512×512, DC‑AE on an H100 GPU provides a 19.1× inference speedup and a 17.9× training speedup for U‑ViT‑H compared with the SD‑VAE‑f8 baseline, while maintaining comparable reconstruction quality.
Figures 8‑12 show that DC‑AE consistently outperforms SD‑VAE‑f8 across all settings in FID, CLIP score, and throughput. Notably, DC‑AE‑f32 p1 reaches 1.72 FID on ImageNet 512×512, surpassing state‑of‑the‑art diffusion models (EDM2‑XXL) and autoregressive models (MAGVIT‑v2, MAR‑L). Training throughput improvements of 4.5× (training) and 4.8× (inference) are reported for DiT‑XL.
Conclusion
DC‑AE proves that ultra‑high‑ratio (up to 128×) downsampling is feasible when Residual Autoencoding and Decoupled High‑Resolution Adaptation are employed. The approach delivers substantial speedups for high‑resolution diffusion models without compromising image quality, opening a new direction for accelerating generative pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
