Artificial Intelligence 18 min read

What Do Recent Multimodal LLM Papers Reveal About Vision‑Language Models?

This article surveys ten recent multimodal large language model papers, covering vision representation laws, a stricter instruction benchmark, safety impacts of visual adaptation, the Mini‑Gemini architecture, automatic pruning, vision capability boosting, long‑context transfer, efficient token sparsification, math reasoning, and hallucination mitigation.

NewBeeNLP

Nov 11, 2024

What Do Recent Multimodal LLM Papers Reveal About Vision‑Language Models?

Introduction

With the rapid rise of large language models, applying them to visual tasks has become a hot research direction. This review selects and summarizes ten papers that explore training, safety analysis, and efficient deployment of multimodal large language models (MLLMs), illustrating the current state of the field.

Law of Vision Representation in MLLMs

The authors propose a “Vision Representation Law” linking model performance to cross‑modal alignment (A) and visual‑representation correspondence (C). They define an AC score as a polynomial combination of A and C, compute A via cosine similarity between CLIP embeddings and target visual embeddings, and obtain C by predicting keypoints from extracted image features. Linear regression on four vision benchmarks shows a 95.72% correlation between AC score and performance, and the AC‑based training strategy can predict and efficiently achieve optimal model performance.

MIA‑Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

This benchmark introduces 400 image‑prompt pairs covering diverse scenes (animals, food, landmarks, etc.) and a hierarchy of complex, compositional instructions across five categories. Using GPT‑4o for automated evaluation, the authors reveal persistent shortcomings of current models in adhering to intricate instructions.

How Does Vision‑Language Adaptation Impact Model Safety?

Experiments with LLaMA‑2‑Chat 7B and Tulu‑2 7B show that visual instruction fine‑tuning degrades safety even when training data are carefully filtered. Safety‑focused fine‑tuning (SFT) and RLHF improve safety but cannot fully eliminate degradation. Layer‑wise cosine similarity analysis reveals that early layers remain similar (≈1.0) while deeper safety‑critical layers (6‑14) drop to ~0.5 after visual adaptation, indicating substantial internal drift.

To mitigate this, the authors merge parameters of a safety‑fine‑tuned multimodal model with those of the original model, preserving visual abilities while restoring safety performance.

Mini‑Gemini: Mining the Potential of Multimodality Vision‑Language Models

Mini‑Gemini introduces dual visual encoders: a CLIP‑ViT for low‑resolution inputs and a CNN‑based encoder for high‑resolution images. A block‑information mining module retrieves high‑quality high‑resolution tokens based on low‑resolution tokens, keeping total token count constant while enriching visual information.

On the data side, the model leverages large‑scale visual‑language instruction data and 13K GPT‑4‑generated instruction‑following examples, supplemented by Stable Diffusion‑generated images. Models ranging from 2B to 34B (including MoE variants) achieve state‑of‑the‑art zero‑shot results, surpassing some closed‑source systems.

SLIMLLAVA: Automatic Pruning for Large Vision‑Language Models

SLIMLLAVA tackles the high resource consumption of MLLMs during deployment. It searches pruning strategies using a small sample set and maximizes generalization via Structural Risk Minimization (SRM). By treating Projector‑layer weights as the search space and optimizing based on Euclidean distance, the method yields a pruning policy that preserves performance on unseen downstream tasks while reducing compute.

Improving Multimodal LLM Through Boosting Vision Capabilities

The Arcana model incorporates two innovations: (1) Multimodal LoRA (MM‑LoRA) with separate LoRA branches for vision and language, enabling specialized learning; (2) a ladder‑shaped Query Adapter (QLadder) that aggregates intermediate CLIP representations into richer visual features. Together they markedly improve visual perception and downstream task performance.

Long Context Transfer from Language to Vision

Instead of visual token reduction, the authors extend the language backbone’s context length, allowing the model to ingest up to 2000 frames (~200K visual tokens) without additional video training. They introduce V‑NIAH, a “needle‑in‑a‑haystack” benchmark for long‑video understanding, and LongVA, which achieves SOTA results on Video‑MME and MLVU with a 7B model.

ZIPVL: Efficient Large Vision‑Language Models with Dynamic Token Sparsification and KV‑Cache Compression

ZIPVL addresses both pre‑fill attention bottlenecks and KV‑cache memory limits by dynamically selecting important tokens per layer based on normalized attention scores. Important tokens receive full‑precision KV storage, while others are quantized to low‑bit. Experiments show 2.6× pre‑fill speedup, 50% GPU memory reduction, and only a 0.2% accuracy drop on Video‑MME for LongVA‑7B.

MathGLM‑Vision: Solving Mathematical Problems with Multimodal LLMs

MathGLM‑Vision is built on a new MathVL fine‑tuning dataset that adds diverse visual information beyond geometry. After supervised fine‑tuning on MathVL, models of various scales outperform existing multimodal math solvers on public benchmarks and a 2000‑question MathVL‑test set, demonstrating the value of richer visual data.

Self‑Introspective Decoding: Alleviating Hallucinations for Large Vision‑Language Models

SID leverages the observation that pretrained LVLMs can assess visual token importance using earlier language and visual context. The Context‑and‑Text‑aware Token Selection (CT2S) strategy discards the least important visual tokens after early decoding layers, reducing hallucinations without extra computation. Experiments confirm lower hallucination rates and comparable overall capability.