Artificial Intelligence 10 min read

MobileMamba: Lightweight Multi‑Receptive‑Field Backbone Beats Existing Mamba Models

MobileMamba introduces a three‑stage, lightweight backbone with a multi‑receptive‑field feature‑interaction module that combines wavelet‑enhanced Mamba, multi‑kernel depthwise convolutions, and redundant‑mapping reduction, delivering up to 83.6% ImageNet Top‑1 accuracy while running 21× faster than LocalVim and 3.3× faster than EfficientVMamba.

AIWalker

Mar 11, 2025

MobileMamba: Lightweight Multi‑Receptive‑Field Backbone Beats Existing Mamba Models

Background

Lightweight vision models originally relied on CNN designs such as MobileNet and GhostNet, which reduce computation with depthwise separable convolutions. CNNs suffer from limited effective receptive fields (ERF), especially on high‑resolution inputs, leading to weak long‑range dependencies.

Vision Transformers (ViT) provide global receptive fields and strong long‑distance modeling, but their quadratic computational complexity makes them unsuitable for high‑resolution, low‑power scenarios. Hybrid CNN‑Transformer architectures combine local and global cues, yet the quadratic cost remains a bottleneck.

State‑space models (e.g., Mamba) capture long‑range dependencies with linear complexity. Lightweight Mamba variants such as LocalMamba and EfficientVMamba report low FLOPs, but empirical measurements show poor inference throughput.

Design of MobileMamba

Coarse‑grained architecture : Four‑stage and three‑stage backbones were compared (Section 3.1). Under equal throughput, the three‑stage network achieved higher Top‑1 accuracy, so MobileMamba adopts a three‑stage backbone as its coarse‑grained framework.

Fine‑grained MRFFI module (Section 3.2): The Multi‑Receptive‑Field Feature Interaction (MRFFI) module splits input channels into three branches:

Wavelet‑enhanced Mamba (WTE‑Mamba) extracts global features while strengthening edge‑detail extraction.

Multi‑kernel depthwise convolution (MK‑DeConv) provides multi‑scale receptive fields efficiently.

Redundant‑mapping identity branch removes channel redundancy in high‑dimensional space, reducing computation.

The fused output combines global and multi‑scale local information, improving high‑frequency detail capture.

Training and inference strategies (Section 3.3): Two training phases—knowledge distillation followed by extended training epochs—enhance model learning. A test‑time normalization‑layer fusion further accelerates inference.

Experimental Results

ImageNet‑1K experiments show MobileMamba achieving up to 83.6 % Top‑1 accuracy across FLOPs ranging from 200 M to 4 G, surpassing state‑of‑the‑art CNN, ViT, and Mamba baselines. Speed comparisons reveal MobileMamba is 21× faster than LocalVim and 3.3× faster than EfficientVMamba while delivering higher accuracy (+0.7 % Top‑1 over LocalVim, +2.0 % over EfficientVMamba).

Downstream tasks validate the design:

Mask RCNN: +1.3 % mAP and 56 % higher throughput.

RetinaNet: +2.1 % mAP and 4.3× higher throughput.

SSDLite (higher resolution): 24.0 % / 29.5 % mAP.

Segmentation (DeepLabv3, Semantic FPN, PSPNet): up to 37.4 % / 42.7 % / 36.9 % mIoU with markedly fewer FLOPs.

Compared with CNN‑based MobileNetV2 and ViT‑based MobileViT‑V2 on high‑resolution inputs, MobileMamba gains 7.2 % and 0.4 % respectively while using only 8.5 % and 11.2 % of their FLOPs.

Contributions

Three‑stage MobileMamba framework that balances performance and efficiency.

MRFFI module integrating wavelet‑enhanced Mamba, multi‑kernel depthwise convolutions, and redundant‑mapping reduction to enlarge ERF and improve high‑frequency detail extraction.

Training and test‑time strategies that boost both accuracy and speed across a wide range of FLOPs.

Source code and model weights are fully open‑source at https://github.com/lewandofskee/MobileMamba. The paper is available at https://arxiv.org/pdf/2411.15941.

Effective receptive field visualizations and FLOPs comparison

Performance vs. throughput of existing Mamba‑based lightweight models

CNN Transformer benchmark Mamba lightweight vision MobileMamba multi-receptive field state-space model

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.