How Mamba-Adaptor Revives State‑Space Models for Vision Tasks

The Mamba-Adaptor introduces a dual‑module adapter that overcomes causal computation limits, long‑range memory decay, and spatial structure loss in state‑space models, delivering state‑of‑the‑art results on ImageNet, COCO, and various downstream visual tasks with minimal overhead.

Data Party THU
Data Party THU
Data Party THU
How Mamba-Adaptor Revives State‑Space Models for Vision Tasks

Background

State‑space models (SSMs) such as Mamba offer linear computational complexity for long sequences, but three limitations hinder their use in vision:

Causal computation prevents global context exchange.

Long‑range memory decays, causing early information to be forgotten.

Flattening 2‑D images into 1‑D sequences destroys spatial dependencies.

Mamba‑Adaptor Overview

The CVPR 2025 paper Mamba‑Adaptor: State Space Model Adaptor for Visual Recognition introduces a plug‑and‑play dual‑module adapter that directly addresses the three issues while adding negligible overhead. The adapter consists of:

Adapter‑T (temporal) – a learnable memory‑selection mechanism.

Adapter‑S (spatial) – multi‑scale dilated convolutions that restore spatial structure.

Adapter‑T: Mitigating Memory Decay

Adapter‑T predicts the K hidden states most likely to be forgotten using lightweight linear layers, generates softmax weights to aggregate the selected memories, and processes them in parallel (similar to multi‑head attention). This dynamic selection improves performance by +0.3 % on ImageNet classification and +0.6 % on COCO detection compared with static selection.

Adapter‑S: Restoring Spatial Structure

Adapter‑S reshapes the Mamba output sequence back to a 2‑D feature map and applies depthwise convolutions with varying dilation rates to capture multi‑scale dependencies. Channel‑wise aggregation keeps the output dimensions unchanged, preserving efficiency. Adding more convolution kernels raises instance‑segmentation AP^m by +0.7 % .

Adapter architecture diagram
Adapter architecture diagram

Flexible Application Scenarios

General visual backbone – The architecture includes a patch‑embedding layer, four adapter‑enhanced Mamba stages, and task heads. Results on ImageNet:

b1 (48 channels) achieves 78.4 % top‑1 accuracy, surpassing LocalViM‑T (+2.2 %) and Vim‑T (+2.3 %).

b2 (96 channels) reaches 82.9 % top‑1 , beating Swin‑T (+2.6 %) and VMamba‑T (+0.2 %).

Pre‑training enhancement – Inserting the adapter into a pretrained VMamba and fine‑tuning for 10 epochs adds only 3.2 % parameters and 6.1 % FLOPs, improving VMamba‑T by 0.1 % and VMamba‑B by 0.2 %.

Efficient transfer‑learning fine‑tuning – Parallel insertion preserves original features. On CIFAR‑100, a 5.56 % parameter increase attains 90 % of full‑fine‑tuning performance; on SVHN and Food‑101 it outperforms linear probing and VPT, while a zero‑initialization trick boosts accuracy by over 13 %.

Dense Prediction Performance

When used as the backbone of Mask‑RCNN on COCO:

1× training schedule, b1 version achieves 43.2 % mAP , 3.9 % higher than EffVMamba‑S.

3× training schedule, b2 version improves AP^b by 1.8 % and AP^m by 2.3 % over VMamba‑T, demonstrating superior spatial modeling.

COCO detection results
COCO detection results

Implementation Details

Decompose the SSM solver into an optimized Mamba operator and a matrix‑multiplication component.

Replace the identity matrix with a learnable mask to insert the adapter into hidden‑state computation.

Provide two insertion modes: sequential (for training from scratch) and parallel (for transfer learning).

Integration diagram
Integration diagram

Conclusion

Mamba‑Adaptor eliminates the temporal decay and spatial modeling deficiencies of vanilla SSMs while remaining lightweight and easy to integrate. Its plug‑and‑play design demonstrates broad applicability across classification, detection, and segmentation, suggesting that state‑space models can become competitive alternatives to Transformers in computer‑vision research and practice.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deep learningAdapterCOCOImageNetState Space ModelMamba-Adaptor
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.