How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs
This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.
Two Main Architectures of Multimodal Large Models
Multimodal large language models (MLLMs) mainly adopt either a decoder‑only architecture or a cross‑attention architecture. The decoder‑only design processes visual and textual tokens with full self‑attention and feed‑forward network (FFN) layers, yielding strong performance but high computational cost, especially with many visual tokens from high‑resolution images. The cross‑attention design inserts a dedicated cross‑modal interaction layer, reducing the length of token sequences that each self‑attention block must handle, thus lowering computation at a modest performance trade‑off.
Root Causes of Visual Token Redundancy
Five factors explain why decoder‑only models waste computation on visual tokens:
Modal nature difference : Visual tokens contain fine‑grained image details that become redundant after early layers, while textual tokens require deep processing.
Repeated processing across layers : Similar self‑attention and FFN transformations are applied to visual tokens in many layers without adding new information.
Long‑sequence burden : High‑resolution inputs generate far more visual tokens than textual tokens.
Structured/clustered redundancy : Certain layers contribute little to overall performance and can be omitted.
Attention locality redundancy : Global self‑attention on visual tokens performs many ineffective operations.
Training‑Free Dynamic Computation Reduction
The proposed method reduces inference cost without token compression or model retraining. It consists of two dynamic modules and a layer‑ranking algorithm.
Probe‑Activated Dynamic FFN
Standard FFNs apply two linear transformations ( W1 → activation → W2) to every visual token. The dynamic variant first samples a small set of visual tokens as probes, computes their intermediate representations, and ranks hidden dimensions by the absolute mean of these representations. Only the top‑K dimensions remain active, disabling the rest for the current inference pass.
Advantage: No additional training is required; the model selects parameters on‑the‑fly, substantially lowering visual‑token processing cost.
Hollow Attention
Instead of full‑global self‑attention, Hollow Attention restricts attention to a local window around each token, introducing structural sparsity and cutting the quadratic cost of global attention.
Layer Ranking Algorithm
Different layers contribute unevenly to model performance. The algorithm computes a ranking score for each layer using metrics such as activation magnitude, gradient norm, or output change. Layers with low scores are prioritized for computation reduction, while performance‑critical layers retain full computation.
Experimental Validation
Half‑Layer Reduction Preserves Performance
Reducing visual‑token computation in roughly 50 % of layers keeps benchmark performance stable and can even improve it on some tasks. Further reduction leads to sharp degradation, especially when FFN pruning is aggressive.
Redundancy Concentrated in Visual Tokens
Comparing "visual‑only computation reduction" with "full‑token reduction" shows that limiting computation to visual tokens retains performance, whereas reducing all tokens causes a steep drop, confirming that redundancy is primarily in visual‑token handling.
Compatibility with Existing Token‑Compression Techniques
The method reduces FLOPs to about 50 % of the original while maintaining accuracy and can be combined seamlessly with other token‑compression approaches.
Conclusion
Visual‑token redundancy in decoder‑only MLLMs stems from modal differences, repeated processing, long‑sequence burden, and structured/clustered inefficiencies; it can be precisely identified and exploited.
The dynamic computation reduction framework—combining Probe‑Activated Dynamic FFN, Hollow Attention, and the Layer Ranking Algorithm—cuts inference cost by roughly half with negligible performance loss, and it is compatible with other acceleration techniques.
Because the approach is training‑free and low‑intrusion, it enables deployment of high‑performance decoder‑only multimodal models on compute‑constrained devices such as edge servers or mobile platforms.
Paper: RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder‑Only MLLMs
URL: https://arxiv.org/abs/2501.19036v3How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
