Artificial Intelligence 8 min read

How Skip-Vision Cuts Multimodal Model Costs by Up to 75% Without Losing Accuracy

Skip-Vision introduces a token‑skipping framework for vision‑language models that dramatically reduces training and inference FLOPs—saving 22%‑40% training time and 40%‑75% inference cost—while preserving performance on benchmarks such as MMBench, MMVet, and MMStar.

AI Frontier Lectures

Jul 17, 2025

How Skip-Vision Cuts Multimodal Model Costs by Up to 75% Without Losing Accuracy

Technical Background

Vision‑language models (VLMs) split an image into hundreds‑thousands of visual tokens processed by multiple Transformer layers, causing heavy compute and memory costs.

Training stage: each token passes through the Feed‑Forward Network (FFN) and Self‑Attention (SA) of every layer, leading to large GPU hours.

Inference stage: all tokens are stored in the KV‑Cache, inflating memory usage and latency; for example, LLaVA processes a single image with ~2×10¹² FLOPs and >150 ms latency.

Skip‑Vision Core Method

Skip‑Vision reduces redundancy by classifying visual tokens into retained and skipped . Retained tokens traverse the full decoder; skipped tokens are merged and processed only in the self‑attention sub‑layer, bypassing the FFN.

1. Training Phase – Skip‑FFN

Observations show many visual tokens change little after the FFN. Skip‑Vision therefore skips the FFN for those tokens. The mechanism:

Compute the magnitude ratio of token features before and after the FFN; tokens with low change are marked as skipped .

Skipped tokens are merged (e.g., via token‑merge or pooling) and only participate in the self‑attention computation.

Retained tokens continue through all layers.

This yields a 22 %–40 % reduction in training FLOPs and memory with negligible performance loss on models such as LLaVA.

2. Inference Phase – Skip KV‑Cache

During multimodal decoding, early attention layers already aggregate most visual information into a small set of summary tokens . Skip‑Vision removes the redundant visual tokens from the KV‑Cache, keeping only retained and summary tokens. The result is a 40 %–75 % reduction in inference FLOPs and an 18 %–45 % latency decrease.

3. Summary Token Mechanism

Before discarding skipped tokens, their information is projected onto a few summary tokens via attention. These summary tokens continue to participate in subsequent layers, ensuring essential visual knowledge is preserved.

4. Theoretical Guarantee

The authors provide an error‑upper‑bound analysis for Skip‑FFN under spectral‑norm assumptions of the Transformer, showing that the approximation error is tightly bounded and matches empirical measurements.

Experimental Results

Evaluations on multimodal benchmarks (MMBench, MMVet, MMStar) demonstrate that Skip‑Vision maintains accuracy comparable to the full‑token baseline while achieving the reported efficiency gains.

Performance‑efficiency trade‑off curves for LLaMA‑3 8B and Vicuna‑1.5 7B illustrate the same trend. Extended evaluations on additional models confirm the generality of the approach.

Feature magnitude ratio before and after FFN shows visual tokens receive far smaller updates than text tokens

Performance and efficiency trade‑off curves for LLaMA3‑8B

Performance and efficiency trade‑off curves for Vicuna‑1.5‑7B

Extended evaluation of Skip‑Vision on various models

Paper: https://arxiv.org/abs/2503.21817. Project page with code and resources: https://zwl666666.github.io/Skip-Vision/.

Code example

收
藏
，
分
享
、
在
看
，
给
个
三
连
击呗！

Transformer Optimization Multimodal Efficiency Skip-Vision Token Skipping

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.