How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs
This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.
