Artificial Intelligence 19 min read

Pure 3×3 Convolutions for Image‑Generation Diffusion Models: The DiC Approach

The paper introduces DiC, a fully convolutional diffusion model that rethinks 3×3 convolutions, adds sparse skip connections, stage‑specific embeddings and conditional gating, and demonstrates superior FID/IS scores and faster inference compared to diffusion Transformers across multiple scales.

AIWalker

Jan 14, 2025

Pure 3×3 Convolutions for Image‑Generation Diffusion Models: The DiC Approach

Background and Motivation

Recent diffusion models have shifted from traditional U‑Net CNN‑Attention hybrids to fully Transformer‑based architectures, achieving strong generation quality but incurring heavy self‑attention computation that slows inference. The authors observe that the simple, hardware‑friendly 3×3 stride‑1 convolution, especially when accelerated by Winograd, offers a promising alternative.

Re‑thinking 3×3 Convolutions

To address the limited receptive field of isolated 3×3 convolutions, the authors evaluate three architecture families: isotropic (e.g., DiT) [3], isotropic with skip connections (e.g., U‑ViT) [2], and encoder‑decoder hourglass (e.g., ADM) [11]. Experiments show that the hourglass design, which downsamples in the encoder and upsamples in the decoder, substantially enlarges the effective receptive field and is essential for pure‑conv diffusion models.

However, dense skip connections in a deep hourglass become a bottleneck because concatenating channel‑wise features across many blocks raises computational cost. The paper therefore proposes Sparse Skip Connections , applying a skip only every few blocks, which preserves information flow while reducing overhead.

Conditioning Enhancements

Because each stage of the encoder‑decoder operates at different channel dimensions, a single shared condition embedding is suboptimal. The authors introduce Stage‑Specific Embeddings , aligning embedding size with each stage’s feature dimension; this adds only 14 M parameters (≈2 % of model size) and 12 M FLOPs.

They also investigate where to inject conditioning. Two common strategies exist: early injection via LayerNorm (DiT) and mid‑block injection (ADM). Empirical results indicate that inserting the condition into the second convolution of each block yields the best trade‑off between efficiency and quality.

Finally, the model adopts Conditional Gating (AdaLN) from DiT, adding a channel‑wise gate vector that dynamically modulates features, and replaces SiLU with GELU following ConvNeXt, further improving performance.

Model Variants and Training Setup

Four model scales are defined to align FLOPs and parameter counts with DiT‑S/2, DiT‑B/2, DiT‑XL/2, plus a larger DiC‑H for scaling studies. All models use a global batch size of 256, learning rate 1e‑4, weight decay 0, and are trained for up to 400 K iterations (with longer runs for scaling experiments). Winograd‑optimized FLOPs are reported alongside raw counts.

Experimental Results

Across all scales, DiC outperforms the corresponding DiT baselines. For example, DiC‑S reduces FID from 67.40 to 58.68 and raises IS from 20.44 to 25.82; DiC‑B improves FID from 42.84 to 32.33 and IS from 33.66 to 48.72. The largest model, DiC‑XL, achieves FID 13.11 (vs 20.05 for DiT‑XL/2) and IS 11.15, while maintaining a throughput of 313.7 images/s.

On ImageNet 512×512, DiC‑XL uses 464.3 G FLOPs (after Winograd) versus 524.7 G for DiT‑XL/2, yet delivers better FID/IS scores. DiC‑H balances size and speed, reaching FID 11.36, IS 106.52, and a throughput of 160.8, demonstrating that a pure‑conv design can match or exceed diffusion Transformers in both quality and efficiency.

Scaling and Convergence

Training curves show rapid convergence: DiC‑H reaches FID 9.73 after 600 K steps and its best FID 8.96 at 800 K steps, matching DiT‑XL/2 performance with fewer resources. Additional experiments on conditional generation (CFG = 1.5) confirm that DiC retains strong conditional synthesis capabilities.

Conclusion

The study proves that a carefully engineered 3×3 convolutional backbone—augmented with sparse skip connections, stage‑specific conditioning, and adaptive gating—can deliver state‑of‑the‑art image synthesis quality while offering significant speed advantages over self‑attention‑heavy diffusion Transformers.

Comparison of mainstream diffusion architectures

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Diffusion Models image synthesis performance benchmarking convolutional networks

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.