DefMamba: How Deformable Scanning Boosts Vision State‑Space Models

DefMamba introduces a deformable visual state‑space model that dynamically adjusts scanning paths and reference points, preserving spatial structure and improving feature capture, achieving state‑of‑the‑art results on ImageNet classification, COCO detection, and ADE20K segmentation while reducing computational cost.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
DefMamba: How Deformable Scanning Boosts Vision State‑Space Models

Introduction

State‑space models (SSMs) such as S4 and Mamba have attracted attention for their linear computational complexity with respect to sequence length, making them attractive alternatives to CNNs and Transformers for visual tasks. However, most visual Mamba variants flatten images using fixed scanning orders, which discards spatial structure and limits feature extraction.

To address this, the authors propose DefMamba, a deformable visual state‑space model that incorporates a multi‑scale backbone and a Deformable Mamba (DM) module capable of dynamically adjusting scanning paths to prioritize important information.

Related Work

Previous works (ViM, VMamba, PlainMamba, QuadMamba, GrootV, etc.) explore various scanning strategies—raster, local, continuous, or tree‑based—but either rely on fixed scan orders or only partially adapt to input content, leading to loss of spatial cues or insufficient sensitivity to object detail changes.

Method

Preliminaries

SSMs model the evolution of hidden states via continuous‑time ODEs and are discretized for sequence‑to‑sequence mapping. Mamba introduces a selective scanning mechanism that conditions parameters on the input, reducing computational cost while expanding the effective receptive field.

Overall Architecture

DefMamba adopts a generic multi‑scale backbone similar to many CNN and Transformer designs. An input image is first split into patches, producing a 2‑D feature map of spatial dimensions H×W and channel dimension C. Four hierarchical stages follow, each consisting of a stack of Deformable Mamba (DM) blocks and a down‑sampling layer (except the last stage). The final feature map is average‑pooled and fed to a classification head.

The DM block follows the Transformer‑style layout: LayerNorm → Feed‑Forward Network → Deformable State‑Space Model (DSSM) → residual connection.

Deformable State‑Space Model (DSSM)

The DSSM introduces two key components: deformable scanning (DS) and a deformable state‑space core. A lightweight sub‑network predicts offset vectors for reference points and token indices. Offsets are constrained to a limited range to preserve relative geometry and are applied in parallel to reduce overhead.

Reference points are generated, normalized to the range [-1, 1], and added to the learned offsets, yielding deformable points. Bilinear interpolation extracts features at these points. To compensate for the loss of positional encoding caused by point movement, a learnable relative‑offset bias matrix (down‑sampled to limit parameters) is added to the interpolated features.

Token indices are similarly offset. Since the offsets are fractional, a sorting algorithm ranks tokens by offset magnitude to produce a new 1‑D sequence. Gradient truncation from sorting is mitigated by averaging gradients across the sequence dimension and copying them back to the original token positions.

Experiments

Image Classification

DefMamba variants (T, S, B) are trained on ImageNet‑1K using AdamW, cosine‑annealing learning rate, and standard data augmentations. DefMamba‑T achieves 78.6% top‑1 accuracy, surpassing RegNetY‑800M and DeiT‑Ti by 2.3% and 6.4% respectively, while reducing FLOPs by 60% compared to PlainMamba‑L1.

Object Detection

Using Mask R‑CNN on COCO 2017, DefMamba‑S reaches 47.5 box mAP and 42.8 mask mAP, outperforming ResNet‑50, Swin‑T, ConvNeXt‑T and narrowing the gap with VMamba‑T while lowering computational load by 4%.

Semantic Segmentation

Integrated into UperNet on ADE20K, DefMamba‑S attains 48.8 mIoU (single‑scale) and 49.6 mIoU (multi‑scale), exceeding ResNet‑50, Swin‑T, ConvNeXt‑T and recent SSM‑based methods.

Ablation Studies

Removing the deformable branch reduces ImageNet accuracy by 1.7%, while adding it increases cost by only 0.1 G FLOPs. Component‑wise ablations (deformable points, deformable token ordering, offset bias, channel attention) each contribute 0.2‑0.4% gains; their combination yields up to a 1% boost.

Limitations

DefMamba struggles when images contain partially visible objects or multiple objects arranged in regular patterns, as the deformable scanning may revert to a fixed pattern or exhibit lazy learning due to minimal token‑wise information change.

Conclusion

DefMamba addresses the spatial‑information loss of fixed‑scan visual models by introducing a depth‑scanning strategy that moves focus points and adapts scan order, resulting in superior performance across classification, detection, and segmentation benchmarks while maintaining competitive efficiency.

References

[1] DefMamba: Deformable Visual State Space Model

computer visionState Space ModelDefMambaDeformable Scanning
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.