UniConvNet: Expanding Effective Receptive Field for a SOTA CNN Vision Backbone (ICCV 2025)

UniConvNet introduces a three‑layer receptive‑field aggregator that combines small kernels to enlarge the effective receptive field while preserving its Gaussian distribution, achieving state‑of‑the‑art results on ImageNet‑1K, COCO2017 and ADE20K with only 30M parameters and 5.1G FLOPs.

AIWalker
AIWalker
AIWalker
UniConvNet: Expanding Effective Receptive Field for a SOTA CNN Vision Backbone (ICCV 2025)

01 | Introduction

Convolutional networks with a large effective receptive field (ERF) have demonstrated strong performance, but they are limited by high parameter counts, FLOP costs, and the disruption of the ERF's asymptotic Gaussian distribution (AGD). UniConvNet proposes a new paradigm: instead of using a single huge kernel, it strategically combines smaller kernels (e.g., 7×7, 9×9, 11×11) to expand the ERF while keeping the AGD intact, resulting in a more efficient design that is still in its early research stage.

02 | Architecture

UniConvNet introduces a three‑layer Receptive‑Field Aggregator (RFA) and a layer operator (LO) designed from the receptive‑field perspective. The input tensor is first split along the channel dimension into N+1 parts, forming heads A1 and H1…HN. Head A1 has shape B×C/(N+1)×H×W (batch, channels, height, width).

Each head passes through a 1×1 convolution for initial projection before entering the LO pipeline. A1 is processed by LO 1, producing A2 whose channel dimension grows from C/(N+1) to 2C/(N+1). Subsequent LOs ( n ∈ [2, N]) double the channels again, yielding 2ⁿC/(N+1) at stage n. The remaining heads H1…HN are fed into their corresponding LO n and interact with the matching An head.

The RFA adopts a pyramid‑shaped channel increase, which reduces both parameters and FLOPs compared with a standard direct‑input‑direct‑output convolution. Adding more layers ( N) enables processing of higher‑resolution images, offering an effective alternative to the traditional workflow of training on low‑resolution images and then fine‑tuning at high resolution.

The three‑layer RFA module is plug‑and‑play and can replace any convolutional layer in existing architectures. To demonstrate its effectiveness, the authors integrate the RFA into the state‑of‑the‑art CNN InternImage. Specifically, the 3×3 convolutional residual block originally uses InternImage's DCNV3 convolution; the authors replace it with a DCNV4‑style version and remove the softmax normalization from DCNV3.

03 | Experimental Results

Extensive experiments on ImageNet‑1K, COCO2017 and ADE20K show that UniConvNet consistently outperforms the latest CNNs and Vision Transformers when throughput is comparable. Notably, the UniConvNet‑T variant, with only 30 M parameters and 5.1 G FLOPs, achieves 84.2 % top‑1 accuracy on ImageNet.

Figures below illustrate the architecture diagram and benchmark results.

CNNcomputer visionEffective Receptive FieldUniConvNetVision BackboneICCV2025
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.