Can Dynamic Computation Reduction Slash Redundancy in Decoder‑Only Multimodal LLMs?
This article analyzes the visual token redundancy in decoder‑only multimodal large language models and presents a training‑free dynamic computation reduction framework—including Probe‑Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that dramatically speeds up inference while preserving or even improving model performance.
