Artificial Intelligence 23 min read

DefMamba: A Deformable Multi‑Scale Visual Foundation Model that Boosts Vision Tasks

DefMamba introduces a multi‑scale backbone, deformable Mamba modules, and a dynamic scanning strategy to preserve image spatial structure, achieving state‑of‑the‑art performance on image classification, object detection, and semantic segmentation benchmarks.

AIWalker

May 12, 2025

DefMamba: A Deformable Multi‑Scale Visual Foundation Model that Boosts Vision Tasks

Introduction

State‑space models (SSMs) such as S4 and Mamba provide linear‑time sequence modeling, but visual Mamba variants typically flatten images with a fixed scan order, which discards spatial structure. DefMamba addresses this limitation by introducing a multi‑scale backbone, Deformable Mamba (DM) blocks, and a Deformable Scan (DS) strategy that dynamically moves reference points and adjusts token order based on image content.

Related Work

Convolutional networks (e.g., RegNet, ConvNeXt) suffer from limited receptive fields, while Transformers achieve global aggregation at high computational cost. SSMs reduce complexity to O(N) by aggregating hidden states, yet their content‑agnostic updates hinder long‑range dependency modeling. Prior visual Mamba methods (ViM, VMamba, PlainMamba, MSVMamba, GrootV, QuadMamba) either keep a fixed scan path or only adapt window granularity, leading to loss of structural information.

Method

Preliminaries

SSMs model the continuous‑time ODE \(\dot{x}=Ax+Bu\) and discretize it via zero‑order hold, yielding a linear‑time sequence‑to‑sequence mapping. Mamba further conditions the state‑space parameters \(A,B\) on the input through a selective scan mechanism (S6), enabling content‑aware updates.

Overall Architecture

DefMamba follows a four‑stage hierarchical backbone. An input image is first processed by a patch‑embedding layer into a feature map of shape \(C\times H\times W\). Each stage produces representations at \(\frac{H}{4}\times\frac{W}{4}, \frac{H}{8}\times\frac{W}{8}, \frac{H}{16}\times\frac{W}{16}, \frac{H}{32}\times\frac{W}{32}\) and consists of a stack of DM blocks followed by a down‑sampling layer (except the last stage). After the final stage, average pooling feeds a classification head. A DM block mirrors a Transformer block: two LayerNorms, a feed‑forward network, a Deformable State‑Space Model (DSSM), and residual connections.

Deformable State‑Space Model

The DSSM replaces the 1‑D convolution in vanilla Mamba with a depthwise convolution and adds a deformable branch. Given input \(X\in\mathbb{R}^{C\times H\times W}\), a sub‑network first applies a depthwise conv, GELU, LayerNorm, and a 1‑D conv to produce a three‑dimensional offset \(\Delta=[\Delta_{x},\Delta_{y},\Delta_{t}]\). The first two dimensions shift a normalized reference point \(p\in[-1,1]^2\) to a deformable point \(p' = p + \Delta_{x,y}\); bilinear interpolation extracts features at \(p'\). The third dimension offsets the token index, generating a deformable token order. Offsets are clamped with a tanh function, split into point‑offset and token‑index channels, and normalized by \(H\) and \(W\). A learnable relative‑position bias matrix (down‑sampled to reduce parameters) compensates for shifted positions. Sorting the adjusted token indices yields a new 1‑D sequence; gradients are averaged across the sequence dimension and duplicated to approximate the gradient of the scan‑order shift.

Experiments

Image Classification

Training follows ImageNet‑1K (1.28 M images, 1 000 classes) protocols with AdamW, cosine LR schedule, 20‑epoch warm‑up, batch size 1024, weight decay 0.05, and EMA. DefMamba‑T achieves 78.6 % top‑1 accuracy, surpassing RegNetY‑800M (+2.3 %) and DeiT‑Ti (+6.4 %). Compared with recent SSMs, DefMamba‑T improves over ViM‑T (+2.5 %), LocalViM‑T (+2.4 %), and MSVMamba‑N (+1.3 %) while reducing FLOPs by 60 % versus PlainMamba‑L1. DefMamba‑S reaches 83.5 % and DefMamba‑B 84.2 % top‑1, the latter beating VMamba‑S by 0.6 %.

Object Detection

Using Mask R‑CNN on COCO 2017, DefMamba‑S attains 47.5 % box mAP and 42.8 % mask mAP, outperforming ResNet‑50, Swin‑T, ConvNeXt‑T and improving over LocalVMamba‑T and QuadMamba‑S by 0.6 points. Performance matches VMamba‑T while cutting compute by 4 %.

Semantic Segmentation

With UperNet on ADE20K, DefMamba‑S records 48.8 % single‑scale mIoU and 49.6 % multi‑scale mIoU, surpassing ResNet‑50, Swin‑T, ConvNeXt‑T and recent SSMs (GrootV‑T, QuadMamba‑S, MSVMamba‑T) by 0.3‑1.6 points.

Ablation Studies

Deformable branch impact – Adding the deformable branch (DB) improves ImageNet top‑1 by 1.7 % with only 0.1 G extra FLOPs; combining forward/backward (FB‑BB) with DB yields an additional 1.4 % gain. Removing DB alone harms performance due to increased token jumps.

Component impact – Isolating deformable point (DP), deformable token (DT), offset bias (OB), and channel attention (CA) each adds 0.2‑0.4 % accuracy (~0.1 G FLOPs). DP+DT together give +1 % over the baseline; adding OB and CA further validates their contributions.

Visualization

Activation maps show DefMamba focusing on object structures in crowded scenes. Deformable points migrate toward objects, and token order shifts prioritize salient tokens, demonstrating enhanced sensitivity to fine‑grained details.

Limitations

When images contain incomplete objects or multiple regularly arranged objects, the deformable scan may revert to a fixed pattern, reducing effectiveness.

image classification computer vision object detection semantic segmentation visual foundation model DefMamba deformable state space

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.