Artificial Intelligence 10 min read

How OverLoCK Redefines Vision Backbones with Dynamic Convolution

OverLoCK, a new vision backbone inspired by human top‑down attention, combines a three‑stage decomposition, dynamic ContMix convolutions and top‑down guidance to achieve state‑of‑the‑art performance on ImageNet classification, COCO detection and ADE20K segmentation while maintaining strong trade‑offs.

AI Frontier Lectures

Apr 4, 2025

How OverLoCK Redefines Vision Backbones with Dynamic Convolution

Motivation

Human visual perception forms a coarse global impression then focuses on details (top‑down attention). Existing vision backbones such as Swin, ConvNeXt, and VMamba use a strict pyramid hierarchy without explicit top‑down semantic guidance, which limits their ability to capture long‑range dependencies, especially at high resolutions.

Method

OverLoCK (Overview‑first‑Look‑Closely‑next ConvNet with Context‑Mixing Dynamic Kernels) replaces the pyramid with a Deep‑stage Decomposition (DDS) strategy that consists of three sub‑models:

Base‑Net : extracts low‑ and mid‑level features using a Dilated RepConv layer, analogous to retinal processing.

Overview‑Net : quickly generates coarse high‑level semantics that serve as the first‑glance prior.

Focus‑Net : refines details under the guidance of Overview‑Net output, employing the dynamic convolution module ContMix and a gating mechanism.

ContMix computes an affinity map between each token and the centers of multiple regions, converts this map into a learnable dynamic kernel, and injects global context into every kernel weight. During inference the current feature map acts as the query while the top‑down guidance from Overview‑Net serves as the key, enabling strong global modeling even within local windows.

Experiments

Performance on three challenging benchmarks:

ImageNet‑1K : OverLoCK‑Tiny (30 M parameters) achieves 84.2 % top‑1 accuracy, surpassing comparable ConvNets, Transformers and Mamba models.

COCO 2017 : OverLoCK‑S improves APb by 0.8 % over BiFormer‑B and 1.5 % over MogaNet‑B with Mask R‑CNN (1× schedule); with Cascade Mask R‑CNN it outperforms PeLK‑S and UniRepLKNet‑S by 1.4 % and 0.6 % respectively.

ADE20K : OverLoCK‑T gains +1.1 % mIoU over MogaNet‑S, +1.7 % over UniRepLKNet‑T, and 2.3 % over VMamba‑T.

These results demonstrate that dynamic convolution combined with top‑down guidance retains strong inductive bias while achieving Transformer‑level global modeling.

Ablation Study

ContMix is a plug‑and‑play module. Replacing other token mixers with ContMix consistently yields higher performance, especially on high‑resolution semantic segmentation, confirming its powerful context‑mixing capability.

Visualization

Effective receptive‑field analysis shows OverLoCK attains the largest receptive field while preserving local sensitivity. Grad‑CAM visualizations illustrate that Overview‑Net provides a coarse localization which, when injected as top‑down guidance, refines Focus‑Net feature maps, mirroring human visual attention.

Paper: https://arxiv.org/abs/2502.20087

Code repository: https://github.com/LMMMEng/OverLoCK

Code example

收
藏
，
分
享
、
在
看
，
给
个
三
连
击呗！

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision OverLoCK Top-down Attention Vision Backbone

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.