How OverLoCK Redefines Vision Backbones with Dynamic Convolution
OverLoCK, a new vision backbone inspired by human top‑down attention, combines a three‑stage decomposition, dynamic ContMix convolutions and top‑down guidance to achieve state‑of‑the‑art performance on ImageNet classification, COCO detection and ADE20K segmentation while maintaining strong trade‑offs.
Motivation
Human visual perception forms a coarse global impression then focuses on details (top‑down attention). Existing vision backbones such as Swin, ConvNeXt, and VMamba use a strict pyramid hierarchy without explicit top‑down semantic guidance, which limits their ability to capture long‑range dependencies, especially at high resolutions.
Method
OverLoCK (Overview‑first‑Look‑Closely‑next ConvNet with Context‑Mixing Dynamic Kernels) replaces the pyramid with a Deep‑stage Decomposition (DDS) strategy that consists of three sub‑models:
Base‑Net : extracts low‑ and mid‑level features using a Dilated RepConv layer, analogous to retinal processing.
Overview‑Net : quickly generates coarse high‑level semantics that serve as the first‑glance prior.
Focus‑Net : refines details under the guidance of Overview‑Net output, employing the dynamic convolution module ContMix and a gating mechanism.
ContMix computes an affinity map between each token and the centers of multiple regions, converts this map into a learnable dynamic kernel, and injects global context into every kernel weight. During inference the current feature map acts as the query while the top‑down guidance from Overview‑Net serves as the key, enabling strong global modeling even within local windows.
Experiments
Performance on three challenging benchmarks:
ImageNet‑1K : OverLoCK‑Tiny (30 M parameters) achieves 84.2 % top‑1 accuracy, surpassing comparable ConvNets, Transformers and Mamba models.
COCO 2017 : OverLoCK‑S improves APb by 0.8 % over BiFormer‑B and 1.5 % over MogaNet‑B with Mask R‑CNN (1× schedule); with Cascade Mask R‑CNN it outperforms PeLK‑S and UniRepLKNet‑S by 1.4 % and 0.6 % respectively.
ADE20K : OverLoCK‑T gains +1.1 % mIoU over MogaNet‑S, +1.7 % over UniRepLKNet‑T, and 2.3 % over VMamba‑T.
These results demonstrate that dynamic convolution combined with top‑down guidance retains strong inductive bias while achieving Transformer‑level global modeling.
Ablation Study
ContMix is a plug‑and‑play module. Replacing other token mixers with ContMix consistently yields higher performance, especially on high‑resolution semantic segmentation, confirming its powerful context‑mixing capability.
Visualization
Effective receptive‑field analysis shows OverLoCK attains the largest receptive field while preserving local sensitivity. Grad‑CAM visualizations illustrate that Overview‑Net provides a coarse localization which, when injected as top‑down guidance, refines Focus‑Net feature maps, mirroring human visual attention.
Paper: https://arxiv.org/abs/2502.20087
Code repository: https://github.com/LMMMEng/OverLoCK
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
