Artificial Intelligence 12 min read

Can Convolution Replace Self‑Attention for Efficient Image Super‑Resolution?

The paper proposes ESC, a lightweight image super‑resolution network that emulates Transformer self‑attention using large‑kernel and dynamic convolutions, achieving higher PSNR with significantly lower latency and memory consumption, making it suitable for mobile deployment.

AI Frontier Lectures

May 6, 2025

Can Convolution Replace Self‑Attention for Efficient Image Super‑Resolution?

Introduction

Transformer‑based models excel at image super‑resolution (SR) but their self‑attention layers impose heavy memory and latency costs that hinder mobile deployment. The University of Seoul team discovered that self‑attention features across layers are highly similar, prompting them to replace most attention layers with a convolutional module called ConvAttn.

Problem Background and Related Work

SR aims to reconstruct high‑resolution images from low‑resolution inputs, a core computer‑vision challenge. Existing lightweight approaches such as SwinIR‑light reduce FLOPs and parameters but still suffer from increased latency and memory usage. Prior optimizations focus on local window attention, channel attention, or state‑space models.

Key Terminology

ConvAttn (Convolutional Attention) : a module that mimics the long‑range modeling and input‑dependent weighting of self‑attention using a shared large‑kernel convolution (13×13) and a dynamic 3×3 convolution.

Flash Attention : a memory‑efficient attention computation that avoids storing the full attention matrix.

CKA similarity : Centered Kernel Alignment, measuring feature similarity across layers.

Core Design of ESC

The ESC network is built on three innovations:

Hierarchical attention strategy : each ESCBlock retains a single self‑attention layer at its first depth, while all subsequent layers are replaced by ConvAttn.

Dual‑path convolution : ConvAttn combines a shared 13×13 kernel (LK) for long‑range context with a dynamically generated 3×3 kernel (DK) for input‑specific weighting.

Flash Attention optimization : the retained self‑attention layers use Flash Attention with an enlarged 32×32 window.

Architecture Overview

ESC consists of four components: shallow feature extraction, deep feature extraction (multiple ESCBlocks), image‑level skip connections, and an up‑sampling module. Each ESCBlock follows a "1 self‑attention + M ConvAttn" pattern.

Mathematical Formulation of ConvAttn

The module processes input feature F by splitting it into two parts: F_att ∈ ℝ^{H×W×16} and F_idt ∈ ℝ^{H×W×(C‑16)}. A global‑average‑pooled vector followed by two 1×1 convolutions generates the dynamic kernel DK ∈ ℝ^{3×3×1×16}. The output is computed as: F_res = (F_att ⊛ DK) + (F_att ⊛ LK), where LK is the shared 13×13 kernel. The result F_res is concatenated with F_idt and fused via a 1×1 convolution.

Data Preparation and Experimental Design

Experiments cover three scenarios:

Classic SR : training on DIV2K, testing on Set5, Set14, B100, Urban100, Manga109.

Arbitrary‑scale SR : using the LTE up‑sampler to evaluate unseen scales (e.g., ×12).

Real‑world SR : training with RealESRGAN‑generated degradations and testing on RealSRSet.

Ablation studies examine the necessity of the shared large kernel, the dynamic kernel, and the Flash Attention window size.

Results

Visually, ESC restores finer textures and achieves up to 0.29 dB PSNR gain over competing lightweight models. Quantitatively, on Urban100×4 ESC reaches 33.86 dB PSNR with only 627 MB memory, a 31 % reduction compared to SwinIR‑light while improving PSNR by 1.1 dB.

Feature similarity analysis shows that using both LK and DK reduces inter‑layer CKA similarity from 0.89 to 0.83, indicating increased feature diversity. Ablation results reveal:

Removing all self‑attention raises latency by 8 %.

Omitting the dynamic kernel drops PSNR by 0.09 dB.

Reducing the Flash Attention window from 32 to 16 decreases PSNR by 0.41 dB.

Q&A (Three Common Questions)

How is the dynamic kernel generated? Global average pooling followed by two 1×1 convolutions produces a 3×3 kernel with only ~0.3 K parameters.

Why a 13×13 kernel? Empirically balances receptive field and computation, outperforming 9×9 by 0.2 dB and saving 35 % compute versus 17×17.

Deployment tips? Use the ESC‑FP variant, which replaces standard convolutions with depth‑wise separable ones for real‑time mobile inference.

Conclusion and Future Directions

The study demonstrates that carefully designed convolutions can replicate Transformer self‑attention benefits while drastically reducing resource demands, opening new avenues for lightweight vision Transformers. Future work may extend ConvAttn to video SR, explore quantization of dynamic kernels, and apply neural architecture search to optimize large‑kernel sizes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning Transformer convolutional attention efficient CNN

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.