Artificial Intelligence 12 min read

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.

PaperAgent

Apr 8, 2026

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

Two Main Architectures of Multimodal Large Models

Multimodal large language models (MLLMs) mainly adopt either a decoder‑only architecture or a cross‑attention architecture. The decoder‑only design processes visual and textual tokens with full self‑attention and feed‑forward network (FFN) layers, yielding strong performance but high computational cost, especially with many visual tokens from high‑resolution images. The cross‑attention design inserts a dedicated cross‑modal interaction layer, reducing the length of token sequences that each self‑attention block must handle, thus lowering computation at a modest performance trade‑off.

Comparison of Decoder‑Only and Cross‑Attention architectures; visual token self‑attention and FFN dominate computation in Decoder‑Only layers

Root Causes of Visual Token Redundancy

Five factors explain why decoder‑only models waste computation on visual tokens:

Modal nature difference : Visual tokens contain fine‑grained image details that become redundant after early layers, while textual tokens require deep processing.

Repeated processing across layers : Similar self‑attention and FFN transformations are applied to visual tokens in many layers without adding new information.

Long‑sequence burden : High‑resolution inputs generate far more visual tokens than textual tokens.

Structured/clustered redundancy : Certain layers contribute little to overall performance and can be omitted.

Attention locality redundancy : Global self‑attention on visual tokens performs many ineffective operations.

Training‑Free Dynamic Computation Reduction

The proposed method reduces inference cost without token compression or model retraining. It consists of two dynamic modules and a layer‑ranking algorithm.

Probe‑Activated Dynamic FFN

Standard FFNs apply two linear transformations ( W1 → activation → W2) to every visual token. The dynamic variant first samples a small set of visual tokens as probes, computes their intermediate representations, and ranks hidden dimensions by the absolute mean of these representations. Only the top‑K dimensions remain active, disabling the rest for the current inference pass.

Advantage: No additional training is required; the model selects parameters on‑the‑fly, substantially lowering visual‑token processing cost.

Hollow Attention

Instead of full‑global self‑attention, Hollow Attention restricts attention to a local window around each token, introducing structural sparsity and cutting the quadratic cost of global attention.

Layer Ranking Algorithm

Different layers contribute unevenly to model performance. The algorithm computes a ranking score for each layer using metrics such as activation magnitude, gradient norm, or output change. Layers with low scores are prioritized for computation reduction, while performance‑critical layers retain full computation.

Experimental Validation

Half‑Layer Reduction Preserves Performance

Reducing visual‑token computation in roughly 50 % of layers keeps benchmark performance stable and can even improve it on some tasks. Further reduction leads to sharp degradation, especially when FFN pruning is aggressive.

Impact of reducing attention or FFN at different layer ratios

Redundancy Concentrated in Visual Tokens

Comparing "visual‑only computation reduction" with "full‑token reduction" shows that limiting computation to visual tokens retains performance, whereas reducing all tokens causes a steep drop, confirming that redundancy is primarily in visual‑token handling.

Performance comparison on ChartQA between visual‑token reduction (blue) and full‑token reduction (red) using InternVL2‑8B

Compatibility with Existing Token‑Compression Techniques

The method reduces FLOPs to about 50 % of the original while maintaining accuracy and can be combined seamlessly with other token‑compression approaches.

Comparison of training‑free acceleration methods for MLLM inference

Conclusion

Visual‑token redundancy in decoder‑only MLLMs stems from modal differences, repeated processing, long‑sequence burden, and structured/clustered inefficiencies; it can be precisely identified and exploited.

The dynamic computation reduction framework—combining Probe‑Activated Dynamic FFN, Hollow Attention, and the Layer Ranking Algorithm—cuts inference cost by roughly half with negligible performance loss, and it is compatible with other acceleration techniques.

Because the approach is training‑free and low‑intrusion, it enables deployment of high‑performance decoder‑only multimodal models on compute‑constrained devices such as edge servers or mobile platforms.

Paper: RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder‑Only MLLMs
URL: https://arxiv.org/abs/2501.19036v3

multimodal LLM Efficient Inference decoder-only architecture dynamic computation visual token redundancy

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.