Artificial Intelligence 11 min read

OverLoCK: How a Bio‑Inspired Three‑Stage ConvNet Beats Transformers on Vision Tasks

OverLoCK introduces a bio‑inspired depth‑stage decomposition that splits a network into Base‑Net, Overview‑Net and Focus‑Net, and a novel Context‑Mix dynamic convolution, achieving state‑of‑the‑art accuracy on image classification, detection and segmentation while balancing speed and model size.

AI Frontier Lectures

May 15, 2025

OverLoCK: How a Bio‑Inspired Three‑Stage ConvNet Beats Transformers on Vision Tasks

Paper Overview

OverLoCK: An Overview‑first‑Look‑Closely‑next ConvNet with Context‑Mixing Dynamic Kernels proposes a pure convolutional backbone that mimics the human visual system’s “overview‑then‑focus” mechanism. Source code is available at https://bit.ly/4bdbmdl.

Key Contributions

Depth‑Stage Decomposition (DDS) : The network is split into three sub‑networks—Base‑Net, Overview‑Net, and Focus‑Net—allowing a lightweight overview branch to generate a semantic‑rich global prior that guides the deeper focus branch.

Context‑Mix Dynamic Convolution (ContMix) : A token‑wise dynamic kernel is generated by computing affinities between each token and the centers of contextual regions, enabling long‑range dependency modeling while preserving local inductive bias.

OverLoCK Backbone : Combining DDS and ContMix yields the OverLoCK family (XT, T, S, B) with adjustable channels, block counts, kernel sizes and group numbers, offering a flexible trade‑off between accuracy and efficiency.

Methodology

3.1 Depth‑Stage Decomposition

Base‑Net progressively downsamples the input image to an intermediate feature map. This map feeds both Overview‑Net, which quickly produces a low‑resolution semantic overview, and Focus‑Net, which refines the features with larger receptive fields. The overview features act as a contextual prior that is concatenated with the focus features at each block, and the prior is updated dynamically throughout the forward pass.

During ImageNet‑1K pre‑training, each of Overview‑Net and Focus‑Net has its own classification head. For downstream tasks only the Focus‑Net head is used. In dense prediction tasks the model builds a feature pyramid from Base‑Net features at two resolutions and Focus‑Net features at two higher resolutions, corresponding to the four stages of the backbone.

3.2 Context‑Mix Dynamic Convolution

Given an input feature map X, it is split into a token branch and a context branch via 1×1 convolutions and reshaping. Tokens and context tokens are divided into G groups (analogous to multi‑head attention). For each group, a token‑wise affinity matrix A = T·Cᵀ is computed by matrix multiplication, where T and C are the flattened token and context tensors. A learnable linear layer aggregates the affinities, followed by softmax normalization. Each row of the normalized matrix is reshaped into a spatially varying convolution kernel. Channels within the same group share the kernel, and the dynamic kernels are applied to the feature map, allowing each token to incorporate global context.

Network Architecture

The OverLoCK family contains four variants: XT (extra‑tiny), T (tiny), S (small) and B (base). Model size is controlled by four hyper‑parameters: channel numbers, block numbers, kernel sizes and group numbers. For example, OverLoCK‑XT uses channels {[56,112,256],[256],[256,336]}, four groups in the focus stages, and specific kernel configurations. The architecture diagrams are shown below.

Experiments

OverLoCK is pretrained on ImageNet‑1K. During pre‑training both Overview‑Net and Focus‑Net have separate classification heads, but only the Focus‑Net is used for downstream tasks. The model is evaluated on image classification, object detection and semantic segmentation. Across all benchmarks OverLoCK surpasses recent ConvNet, Transformer and Mamba‑based backbones, achieving a superior speed‑accuracy balance.

Conclusion

The study demonstrates that a pure ConvNet equipped with a biologically inspired stage decomposition and dynamic, context‑mixing kernels can achieve state‑of‑the‑art performance on a wide range of vision tasks, offering a compelling alternative to attention‑based architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision ConvNet

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.