DefMamba: How Deformable Scanning Boosts Vision State‑Space Models
DefMamba introduces a deformable visual state‑space model that dynamically adjusts scanning paths and reference points, preserving spatial structure and improving feature capture, achieving state‑of‑the‑art results on ImageNet classification, COCO detection, and ADE20K segmentation while reducing computational cost.
Introduction
State‑space models (SSMs) such as S4 and Mamba have attracted attention for their linear computational complexity with respect to sequence length, making them attractive alternatives to CNNs and Transformers for visual tasks. However, most visual Mamba variants flatten images using fixed scanning orders, which discards spatial structure and limits feature extraction.
To address this, the authors propose DefMamba, a deformable visual state‑space model that incorporates a multi‑scale backbone and a Deformable Mamba (DM) module capable of dynamically adjusting scanning paths to prioritize important information.
Related Work
Previous works (ViM, VMamba, PlainMamba, QuadMamba, GrootV, etc.) explore various scanning strategies—raster, local, continuous, or tree‑based—but either rely on fixed scan orders or only partially adapt to input content, leading to loss of spatial cues or insufficient sensitivity to object detail changes.
Method
Preliminaries
SSMs model the evolution of hidden states via continuous‑time ODEs and are discretized for sequence‑to‑sequence mapping. Mamba introduces a selective scanning mechanism that conditions parameters on the input, reducing computational cost while expanding the effective receptive field.
Overall Architecture
DefMamba adopts a generic multi‑scale backbone similar to many CNN and Transformer designs. An input image is first split into patches, producing a 2‑D feature map of spatial dimensions H×W and channel dimension C. Four hierarchical stages follow, each consisting of a stack of Deformable Mamba (DM) blocks and a down‑sampling layer (except the last stage). The final feature map is average‑pooled and fed to a classification head.
The DM block follows the Transformer‑style layout: LayerNorm → Feed‑Forward Network → Deformable State‑Space Model (DSSM) → residual connection.
Deformable State‑Space Model (DSSM)
The DSSM introduces two key components: deformable scanning (DS) and a deformable state‑space core. A lightweight sub‑network predicts offset vectors for reference points and token indices. Offsets are constrained to a limited range to preserve relative geometry and are applied in parallel to reduce overhead.
Reference points are generated, normalized to the range [-1, 1], and added to the learned offsets, yielding deformable points. Bilinear interpolation extracts features at these points. To compensate for the loss of positional encoding caused by point movement, a learnable relative‑offset bias matrix (down‑sampled to limit parameters) is added to the interpolated features.
Token indices are similarly offset. Since the offsets are fractional, a sorting algorithm ranks tokens by offset magnitude to produce a new 1‑D sequence. Gradient truncation from sorting is mitigated by averaging gradients across the sequence dimension and copying them back to the original token positions.
Experiments
Image Classification
DefMamba variants (T, S, B) are trained on ImageNet‑1K using AdamW, cosine‑annealing learning rate, and standard data augmentations. DefMamba‑T achieves 78.6% top‑1 accuracy, surpassing RegNetY‑800M and DeiT‑Ti by 2.3% and 6.4% respectively, while reducing FLOPs by 60% compared to PlainMamba‑L1.
Object Detection
Using Mask R‑CNN on COCO 2017, DefMamba‑S reaches 47.5 box mAP and 42.8 mask mAP, outperforming ResNet‑50, Swin‑T, ConvNeXt‑T and narrowing the gap with VMamba‑T while lowering computational load by 4%.
Semantic Segmentation
Integrated into UperNet on ADE20K, DefMamba‑S attains 48.8 mIoU (single‑scale) and 49.6 mIoU (multi‑scale), exceeding ResNet‑50, Swin‑T, ConvNeXt‑T and recent SSM‑based methods.
Ablation Studies
Removing the deformable branch reduces ImageNet accuracy by 1.7%, while adding it increases cost by only 0.1 G FLOPs. Component‑wise ablations (deformable points, deformable token ordering, offset bias, channel attention) each contribute 0.2‑0.4% gains; their combination yields up to a 1% boost.
Limitations
DefMamba struggles when images contain partially visible objects or multiple objects arranged in regular patterns, as the deformable scanning may revert to a fixed pattern or exhibit lazy learning due to minimal token‑wise information change.
Conclusion
DefMamba addresses the spatial‑information loss of fixed‑scan visual models by introducing a depth‑scanning strategy that moves focus points and adapts scan order, resulting in superior performance across classification, detection, and segmentation benchmarks while maintaining competitive efficiency.
References
[1] DefMamba: Deformable Visual State Space Model
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
