Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

This article presents Dynamic Vision Mamba (DyVM), a method that tackles token and block redundancy in Mamba‑based visual models through a novel re‑ordering pruning strategy and dynamic block selection, achieving a 35.2% FLOPs reduction with only a 1.7% accuracy loss while demonstrating strong generalization across tasks and architectures.

AIWalker
AIWalker
AIWalker
Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

1. Introduction

Vision Mamba models achieve high computational efficiency compared to attention‑based Vision Transformers, yet they suffer from spatial redundancy at both token and block levels. Token redundancy arises from excessive visual tokens, while block redundancy stems from multiple SSM modules that create throughput bottlenecks.

2. Problem Analysis

Statistical analysis on ImageNet‑1K shows that 94.6% of pixel‑wise attention scores are below 70%, indicating many tokens contribute little to performance. Simple mask‑based token pruning, successful in ViTs, breaks training‑inference consistency in Mamba models because the recurrent‑like state propagation is disrupted.

Similarly, experiments on Vim demonstrate that removing SSM blocks yields up to 2.83× speed‑up with minimal FLOPs change, confirming block redundancy.

3. Core Innovations

Re‑ordering Pruning Strategy – During training, tokens are pruned and then reordered so that retained tokens form a contiguous block before entering the next Mamba layer, preserving the sequence order required for inference without extra computation.

Dynamic Block Selection – For each image, a predictor decides how many forward and backward SSM blocks to activate, disabling unnecessary blocks on a per‑sample basis.

End‑to‑End Joint Loss – A combined loss comprising classification, two supervision losses for token‑pruning ratio, and two distillation losses (KL divergence and token‑level MSE) ensures high performance after pruning and block selection.

4. Methodology

4.1 Token Pruning

DyVM performs multi‑stage pruning. At each stage a binary mask predicts whether a token is kept (probability p) or dropped (probability 1‑p) using a Gumbel‑Softmax‑based predictor. The mask is applied to the token embeddings, and the retained tokens are reordered into a contiguous sequence while preserving relative order. The class token is re‑inserted at its original position.

4.2 Dynamic Block Selection

Each Vim layer contains a block selector that takes the class token as input and outputs scores for forward and backward blocks. Scores are binarized via a Gumbel‑sigmoid to produce block masks, which are multiplied with block outputs to skip inactive blocks.

4.3 Training Objective

The total loss L = λ₁L_cls + λ₂L_token_sup + λ₃L_block_sup + λ₄L_KL + λ₅L_token_MSE combines:

Cross‑entropy classification loss.

Mean‑square error (MSE) supervising the token‑pruning ratio at each stage.

MSE supervising the average active block ratio across all layers.

Kullback‑Leibler divergence between student and teacher (original backbone) outputs.

Token‑level MSE aligning retained tokens with teacher tokens.

5. Experiments

5.1 Models and Datasets

DyVM is integrated into Vim‑T, Vim‑S, and Vim‑B, and compared against HiddenAlign (HA). Generalization is evaluated on VideoMamba, MambaReg, and UperNet for image classification, video understanding (Kinetics‑400), and semantic segmentation (ADE20K) respectively.

5.2 Settings

Training uses 30 epochs with cosine learning‑rate schedule, warm‑up of 5 epochs, and batch sizes 128/64/32 for Tiny/Small/Base models. Token pruning uses three stages with a target token ratio of 0.7, and block selection targets a uniform block ratio across layers.

5.3 Main Results

On Vim‑S, DyVM reduces FLOPs by 35.2% while incurring only 1.7% top‑1 accuracy loss, outperforming HA. Similar FLOPs reductions and modest accuracy drops are observed on VideoMamba and MambaReg, confirming strong cross‑architecture generalization. Ablation studies show that increasing the number of pruning stages improves accuracy, and that using raw token inputs for the mask predictor yields the best performance.

5.4 Throughput and Visualization

DyVM accelerates inference on various devices, with larger models (e.g., Vim‑B) benefiting the most. Visualizations of token attention heatmaps and block‑selection masks illustrate how redundant tokens are pruned and how block paths are customized per image.

6. Limitations

Minor performance degradation (≈1.7% accuracy loss) despite large FLOPs savings.

Method is tailored to Mamba‑based visual backbones and may not transfer directly to other architectures.

Introducing learnable predictors and multi‑stage pruning adds training complexity.

7. Conclusion

DyVM introduces a re‑ordering pruning mechanism and dynamic block selection that together resolve training‑inference inconsistency and substantially cut computational cost in Vision Mamba models. The approach maintains competitive accuracy, generalizes across tasks and model families, and opens avenues for designing more efficient Mamba‑based vision architectures.

computer visionModel EfficiencyToken PruningDynamic Block SelectionFLOPs ReductionVision Mamba
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.