Can Dynamic Computation Reduction Slash Redundancy in Decoder‑Only Multimodal LLMs?
This article analyzes the visual token redundancy in decoder‑only multimodal large language models and presents a training‑free dynamic computation reduction framework—including Probe‑Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that dramatically speeds up inference while preserving or even improving model performance.
Multimodal large language models (MLLMs) mainly follow two architectural paradigms:
Decoder‑only (e.g., LLaVA, InternVL2): visual tokens are concatenated with text tokens and processed jointly by self‑attention and feed‑forward networks (FFN). This yields strong performance but the visual token self‑attention and FFN dominate the computational cost, often exceeding 90% of total FLOPs.
Cross‑attention (e.g., Flamingo): visual information is injected via a cross‑attention module, bypassing visual self‑attention and FFN, which reduces computation at the expense of a modest performance drop.
Core problem : Is the heavy visual‑token computation redundant, and can this redundancy be exploited to accelerate inference without retraining?
1. Research Goals and Technical Contributions
The study proposes a completely training‑free inference acceleration framework that dynamically reduces or skips computation for selected visual tokens.
Analyze visual‑token redundancy in pretrained decoder‑only MLLMs using a lightweight probing procedure.
Identify structured redundancy that can guide layer‑wise reduction.
Introduce three complementary mechanisms: Probe‑Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm.
1.1 Probe‑Activated Dynamic FFN
In a standard Transformer layer, each visual token passes through a full FFN (two linear projections W1 → activation → W2). The dynamic variant proceeds as follows:
Sample a small subset of visual tokens (M << N) from the full sequence.
Compute hidden representations for the sampled tokens and take the mean absolute value of each dimension to obtain an importance score for every column of W1 and row of W2.
Select the top‑K dimensions with highest scores, yielding index sets S1 and S2.
During inference, activate only the sub‑matrices W1[:, S1] and W2[S2, :] for all visual tokens; the remaining parameters are skipped.
Advantage: No additional training is required; the model decides on‑the‑fly which FFN parameters to evaluate, dramatically cutting visual‑token computation.
1.2 Hollow Attention
Global self‑attention over all visual tokens is expensive and often redundant. Hollow Attention replaces it with a sparse, local attention pattern:
Each visual token attends only to a fixed‑size window (e.g., 256 neighboring tokens) instead of the entire visual sequence.
Cross‑modal attention between text and visual tokens is preserved unchanged, ensuring multimodal information flow.
Result: Unnecessary visual‑visual interactions are eliminated, yielding substantial resource savings.
1.3 Layer Ranking Algorithm
Not all layers tolerate computation reduction equally. The algorithm ranks layers by visual‑token importance using inexpensive statistics such as:
Mean absolute activation magnitude per layer.
Gradient‑based sensitivity (if available).
Output variance when a small probe set is processed.
Procedure:
Run a few forward passes on a validation subset to collect the chosen metrics.
Sort layers from most to least critical.
Prioritize reduction (dynamic FFN + Hollow Attention) on low‑rank layers first, stopping when a target FLOPs budget is reached.
2. Experimental Validation
Benchmarks were conducted on two state‑of‑the‑art decoder‑only MLLMs: InternVL2‑8B and Qwen2‑VL‑7B .
Applying the dynamic reduction to roughly 50% of the layers (selected by the ranking algorithm) preserves or slightly improves accuracy while cutting inference time by ~30%.
Reducing computation on text tokens leads to a pronounced performance drop, confirming that redundancy is primarily in the visual stream.
Combining the proposed methods with existing visual‑token compression techniques (e.g., token pruning) yields additive gains, achieving up to 50% FLOPs reduction with negligible accuracy loss.
Key quantitative observations (illustrated in the figures):
When self‑attention or FFN is pruned layer‑wise, performance remains stable up to a half‑layer threshold; beyond that, FFN pruning degrades results more sharply.
Selective reduction on visual tokens alone maintains task performance, whereas reducing all tokens causes rapid degradation (ChartQA evaluation).
Table 1 (see image) compares three training‑free acceleration strategies—VTW, FastV (token‑count reduction), and the proposed method (per‑token computation reduction). At a FLOPs ratio of ~0.5, the proposed approach keeps accuracy within 1% of the baseline and can be combined with token‑count methods.
3. Main Conclusions
Visual processing in decoder‑only multimodal LLMs contains substantial redundancy. By dynamically deactivating low‑impact FFN parameters, restricting visual self‑attention to local windows, and targeting layers with low visual‑token importance, inference can be accelerated dramatically without retraining. The framework is compatible with existing token‑pruning methods, reduces energy consumption, and enables deployment of large multimodal models on resource‑constrained hardware.
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMsHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
