How Mamba-Adaptor Revives State‑Space Models for Vision Tasks
The Mamba-Adaptor introduces a dual‑module adapter that overcomes causal computation limits, long‑range memory decay, and spatial structure loss in state‑space models, delivering state‑of‑the‑art results on ImageNet, COCO, and various downstream visual tasks with minimal overhead.
Background
State‑space models (SSMs) such as Mamba offer linear computational complexity for long sequences, but three limitations hinder their use in vision:
Causal computation prevents global context exchange.
Long‑range memory decays, causing early information to be forgotten.
Flattening 2‑D images into 1‑D sequences destroys spatial dependencies.
Mamba‑Adaptor Overview
The CVPR 2025 paper Mamba‑Adaptor: State Space Model Adaptor for Visual Recognition introduces a plug‑and‑play dual‑module adapter that directly addresses the three issues while adding negligible overhead. The adapter consists of:
Adapter‑T (temporal) – a learnable memory‑selection mechanism.
Adapter‑S (spatial) – multi‑scale dilated convolutions that restore spatial structure.
Adapter‑T: Mitigating Memory Decay
Adapter‑T predicts the K hidden states most likely to be forgotten using lightweight linear layers, generates softmax weights to aggregate the selected memories, and processes them in parallel (similar to multi‑head attention). This dynamic selection improves performance by +0.3 % on ImageNet classification and +0.6 % on COCO detection compared with static selection.
Adapter‑S: Restoring Spatial Structure
Adapter‑S reshapes the Mamba output sequence back to a 2‑D feature map and applies depthwise convolutions with varying dilation rates to capture multi‑scale dependencies. Channel‑wise aggregation keeps the output dimensions unchanged, preserving efficiency. Adding more convolution kernels raises instance‑segmentation AP^m by +0.7 % .
Flexible Application Scenarios
General visual backbone – The architecture includes a patch‑embedding layer, four adapter‑enhanced Mamba stages, and task heads. Results on ImageNet:
b1 (48 channels) achieves 78.4 % top‑1 accuracy, surpassing LocalViM‑T (+2.2 %) and Vim‑T (+2.3 %).
b2 (96 channels) reaches 82.9 % top‑1 , beating Swin‑T (+2.6 %) and VMamba‑T (+0.2 %).
Pre‑training enhancement – Inserting the adapter into a pretrained VMamba and fine‑tuning for 10 epochs adds only 3.2 % parameters and 6.1 % FLOPs, improving VMamba‑T by 0.1 % and VMamba‑B by 0.2 %.
Efficient transfer‑learning fine‑tuning – Parallel insertion preserves original features. On CIFAR‑100, a 5.56 % parameter increase attains 90 % of full‑fine‑tuning performance; on SVHN and Food‑101 it outperforms linear probing and VPT, while a zero‑initialization trick boosts accuracy by over 13 %.
Dense Prediction Performance
When used as the backbone of Mask‑RCNN on COCO:
1× training schedule, b1 version achieves 43.2 % mAP , 3.9 % higher than EffVMamba‑S.
3× training schedule, b2 version improves AP^b by 1.8 % and AP^m by 2.3 % over VMamba‑T, demonstrating superior spatial modeling.
Implementation Details
Decompose the SSM solver into an optimized Mamba operator and a matrix‑multiplication component.
Replace the identity matrix with a learnable mask to insert the adapter into hidden‑state computation.
Provide two insertion modes: sequential (for training from scratch) and parallel (for transfer learning).
Conclusion
Mamba‑Adaptor eliminates the temporal decay and spatial modeling deficiencies of vanilla SSMs while remaining lightweight and easy to integrate. Its plug‑and‑play design demonstrates broad applicability across classification, detection, and segmentation, suggesting that state‑space models can become competitive alternatives to Transformers in computer‑vision research and practice.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
