TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

TwigVLM introduces a lightweight “twig” module that prunes visual tokens early and enables self‑speculative decoding, achieving up to 154% speedup on long‑text generation while preserving 96% of original LVLM accuracy, as demonstrated on LLaVA‑1.5‑7B and other benchmarks.

Data Party THU
Data Party THU
Data Party THU
TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

Background

Large vision‑language models (LVLMs) such as GPT‑4V achieve strong multimodal understanding but incur high computational cost and latency, especially due to the large number of visual tokens.

Limitations of Existing Acceleration Methods

Accuracy bottleneck

Token‑pruning methods (e.g., FastV, SparseVLM) rely on shallow‑layer attention to score visual tokens. Experiments show that using deeper‑layer attention yields higher‑quality importance scores and improves accuracy at the same pruning ratio.

Speed bottleneck

LVLM inference consists of a prefilling stage (KV cache for the prompt) and a decoding stage (autoregressive token generation). Decoding time grows linearly with output length and dominates total latency. Existing methods mainly accelerate prefilling, leaving decoding speed largely unchanged.

Proposed Method: TwigVLM

TwigVLM introduces a lightweight “twig” module attached to a frozen backbone VLM and a dual‑stage inference scheme.

The twig block replicates the transformer architecture of the backbone (T transformer layers plus a classification head). It is initialized by copying backbone weights and fine‑tuned only on the twig parameters.

Twig‑guided Token Pruning (TTP) : The final‑layer attention of the twig, being closer to the training loss, provides high‑quality importance scores for visual tokens. These scores guide pruning at an early backbone layer, allowing up to 88.9 % of visual tokens to be removed while preserving accuracy.

Self‑Speculative Decoding (SSD) : The shallow model (backbone up to layer K plus the twig) generates multiple candidate tokens in parallel (draft). The full model then verifies the candidates in a single batch (verify), turning the inherently serial decoding into a highly parallel operation and increasing GPU utilization.

TwigVLM architecture and effect
TwigVLM architecture and effect

Experimental Evaluation

Models evaluated: LLaVA‑1.5‑7B, LLaVA‑NeXT‑7B, Video‑LLaVA‑7B.

Accuracy preservation : With 88.9 % visual token pruning (≈64 tokens retained), TwigVLM achieves 96.0 % relative accuracy, outperforming FastV (77.0 %) and SparseVLM (89.9 %). It also exceeds VisionZip by 0.8 percentage points.

Speedup : On the long‑text generation benchmark MM‑Vet, TwigVLM attains a 154 % relative speedup over the baseline, whereas FastV and VisionZip achieve only ~104 %–106 %.

Qualitative attention visualizations demonstrate that TTP focuses on task‑relevant tokens (e.g., jersey numbers) while competing methods miss critical details.

Attention visualization comparison
Attention visualization comparison
Speed comparison on short and long generation tasks
Speed comparison on short and long generation tasks

Implementation Details

Training copies backbone weights into the twig and fine‑tunes only the twig parameters for K+T layers. The twig is inserted after layer K (e.g., K = 2). After training, the model supports both TTP and SSD without modifying the original backbone.

Code and pretrained weights are released at https://github.com/MILVLG/twigvlm. The paper is available at https://arxiv.org/abs/2503.14075.

References

Chen et al., “An Image is Worth 1/2 Tokens After Layer 2: Plug‑and‑Play Inference Acceleration for Large Vision‑Language Models.”

Zhang et al., “SparseVLM: Visual Token Sparsification for Efficient Vision‑Language Model Inference.”

Leviathan et al., “Fast inference from transformers via speculative decoding.”

Liu et al., “Improved Baselines with Visual Instruction Tuning.”

Li et al., “LLaVA‑NeXT: Improved reasoning, OCR, and world knowledge.”

Lin et al., “Video‑LLaVA: Learning United Visual Representation by Alignment Before Projection.”

Yang et al., “VisionZip: Longer is Better but Not Necessary in Vision Language Models.”

Code example

本文
约4000字
,建议阅读
8
分钟
本文提出了 TwigVLM,一个简洁而强大的 LVLM 推理加速框架。
multimodal AISpeculative Decodingmodel accelerationToken PruningLVLM
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.