Mamba-Adaptor Merges Adaptor‑T and Adaptor‑S to Revolutionize Vision Tasks with State‑of‑the‑Art Benchmarks

The paper introduces Mamba-Adaptor, a plug‑and‑play module combining Adaptor‑T and Adaptor‑S to overcome causal computation, long‑range forgetting, and spatial modeling limits of visual Mamba models, delivering top‑ranked results on ImageNet and COCO across multiple downstream tasks.

AIWalker
AIWalker
AIWalker
Mamba-Adaptor Merges Adaptor‑T and Adaptor‑S to Revolutionize Vision Tasks with State‑of‑the‑Art Benchmarks

1. Introduction

State‑space models (SSMs), especially the Mamba variant, have shown strong performance in visual modeling but suffer from three core constraints: (1) causal computation cannot access global context, (2) long‑range forgetting of hidden states, and (3) weakened spatial modeling due to 1‑D sequence conversion. To address these, the authors propose a lightweight visual‑task Adapter composed of two modules, Adaptor‑T and Adaptor‑S.

Adaptor‑T introduces a learnable memory‑selection layer that picks a set of easily‑forgotten positions, mitigating long‑range forgetting. Adaptor‑S employs multi‑scale dilated convolution kernels to inject image‑level inductive bias and strengthen spatial modeling. Both modules expand the context range accessible to causal computation.

2. Related Work

SSMs originated from linear control systems and have been adapted for long‑sequence language tasks. Recent works such as Mamba [13] added input‑specific parameterization (S6) to improve context capture. Visual adaptations (e.g., ViM, VMamba) convert 2‑D images to 1‑D sequences, partially preserving spatial locality via bidirectional or windowed scans, yet still face long‑range forgetting and spatial degradation. Adapter mechanisms, first popularized in NLP, have been applied to vision (e.g., Adaptformer, VPT, ViT‑Adaptor) to inject task‑specific capacity without full fine‑tuning. The proposed Mamba‑Adaptor builds on these ideas by directly addressing the three constraints of visual Mamba.

3. Method

3.1 Preliminaries

SSMs model an input sequence x through hidden states h using linear state‑transition matrices A, B, C, D. Discretization via zero‑order hold (ZOH) converts continuous dynamics to a form suitable for deep‑learning frameworks.

3.2 Mamba‑Adaptor Formulation

The Adapter operates on the hidden‑state equation (3) and output equation (4). For a given index i, the hidden state h_i is the sum of the previous hidden state and the transformed current input. Because causal computation cannot see distant states, the authors augment h_i with a weighted aggregation over a learned set of accessible hidden states S_i:

h_i = \sum_{j \in S_i} w_{ij} \cdot h_j + f(x_i)

This is the core of Adaptor‑T, addressing time‑decay.

For the output, after reshaping the hidden state back to 2‑D, a convolutional aggregation introduces spatial bias: y = Conv_{k}(h_{reshape}) where k denotes a multi‑scale dilated kernel, forming Adaptor‑S.

3.3 Adaptor‑T

Adaptor‑T replaces manual selection of neighboring states with a lightweight predictor. A linear layer predicts m coordinates of the most forgettable states, and a second linear layer with SoftMax produces corresponding weights. This dynamic selection removes the need for hand‑crafted patterns and aligns with multi‑head attention intuition.

3.4 Adaptor‑S

Adaptor‑S applies depth‑wise dilated convolutions across the channel dimension, preserving feature shape while expanding the receptive field. Multiple kernels with different dilation rates are summed, providing multi‑scale spatial aggregation and stronger inductive bias.

3.5 Implementation

The Adapter is inserted into a highly optimized Mamba solver that separates the original Mamba operator from an efficient matrix‑multiplication path. Two insertion patterns are offered: (1) sequential insertion for training from scratch, and (2) parallel insertion for fine‑tuning a pretrained backbone, preserving original features.

Weight‑sharing coefficients are used for Adaptor‑S to keep parameter growth low; each hidden state receives a fixed‑shape coefficient initialized to zero, reducing linear‑layer cost.

4. Experiments

4.1 Vision Backbone

Two variants are evaluated: Mamba‑Adaptor‑b1 (48 channels) and b2 (96 channels). On ImageNet‑1K, b1 achieves 78.4% Top‑1, surpassing LocalViM‑T by 2.2% and Vim‑T by 2.3%; b2 reaches 82.9% Top‑1, beating Swin‑T by 2.6% and VMamba‑T by 0.2% with comparable FLOPs.

On COCO, integrated as the backbone of Mask‑RCNN, b1 attains the highest mAP, outperforming EffVMamba‑S and PVT‑T under identical FLOPs.

4.2 Booster Network for Image Recognition

Using pretrained VMamba‑T/S/B as baselines, adding Mamba‑Adaptor and fine‑tuning for an extra 10 epochs improves Top‑1 accuracy by up to 0.2% while increasing parameters by only 3.2% and FLOPs by 6.1%.

4.3 Transfer‑Learning Adapter

Compared against linear probing, full fine‑tuning, and Visual Prompt Tuning (VPT) on CIFAR‑100, SVHN, and Food‑101, Mamba‑Adaptor‑T/S/B consistently yields higher Top‑1 accuracy with 5–9% fewer additional parameters. The parallel insertion form preserves pretrained features, explaining its advantage in transfer scenarios.

4.4 Ablation Studies

Removing Adaptor‑T or replacing its learnable selection with a static pattern drops ImageNet Top‑1 by 0.3% and COCO mAP by 0.6%, confirming the benefit of dynamic memory selection. Adding extra depth‑wise kernels to Adaptor‑S improves performance by 0.7% on dense prediction tasks. Zero‑initialization of the Adapter and parallel insertion both contribute significant gains (≈14% top‑1 improvement) in transfer learning.

5. Conclusion and Limitations

The plug‑and‑play Mamba‑Adaptor, comprising Adaptor‑T and Adaptor‑S, effectively tackles time‑decay and spatial locality issues in visual Mamba models. It serves as a universal backbone, a performance‑boosting module, and an efficient fine‑tuning adapter. Extensive experiments on ImageNet, COCO, and several transfer‑learning benchmarks validate its superiority. However, extending the Adapter to other visual Mamba variants remains an open research direction.

Vision TransformersMambaState Space ModelAdaptor
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.