Artificial Intelligence 17 min read

Can Multimodal Models Ditch Frame Sampling? LLaVA‑OneVision‑2.0’s Codec‑Stream

LLaVA‑OneVision‑2.0 replaces uniform frame sampling with a codec‑stream visual unit, integrates a OneVision‑Encoder that tokenizes video as state‑plus‑incremental evidence, and demonstrates consistent gains on 18 video, 11 spatial‑reasoning and 4 tracking benchmarks while open‑sourcing its model, data and code.

Machine Learning Algorithms & Natural Language Processing

Jun 3, 2026

Can Multimodal Models Ditch Frame Sampling? LLaVA‑OneVision‑2.0’s Codec‑Stream

Foundation: Codec as a Visual Prior

Video codecs were created to satisfy strict communication and storage constraints. By using reference frames for context and encoding only motion vectors and residuals for P/B frames, codecs separate video into context + incremental evidence . This predictive structure can be treated as an external prior that tells a model to pay for real changes and ignore predictable background.

Core belief : only the incremental evidence that forces a model to revise its prediction should consume token budget.

Why a Codec‑Native Input?

Natural video contains long periods of static background or slow drift. Uniform frame sampling wastes tokens on predictable content, while the codec provides an event radar: I‑frames establish context and P/B frames record motion vectors and residuals, which are the true evidence of change. The OneVision‑Encoder aligns tokenization with this structure, shifting the modeling goal from “average pixel viewing” to “explain change on top of state”.

Architecture and Unified Visual Interface

The OneVision‑Encoder accepts three evidence forms—uniformly sampled frames, codec streams, and native‑resolution images—and encodes them into 3‑D RoPE visual tokens. A lightweight MLP projects these tokens into the Qwen3‑8B autoregressive decoder, enabling a single model to process static images and to follow temporal change cues in video.

Codec‑Stream Tokenization Pipeline

GOP Partition : use P/B‑frame packet byte size (energy) to locate event peaks; dense regions receive short groups, stable regions receive long groups.

Scoring : fuse motion‑energy, residual‑energy and patch‑level bitrate priors into a per‑patch fused score.

Block Selection : select the smallest unit (2×2 patch) based on the fused score, avoiding merging unrelated areas.

Canvas Packing : output one I‑canvas and several P‑canvases per GOP, forming a compact canvas sequence that serves as the token stream.

This stream‑level, bit‑cost‑aware tokenization replaces fixed‑GOP or uniform sampling, allowing token density to follow event intensity.

Training Procedure

Training proceeds in four stages, gradually increasing frame budget and introducing longer video subtitles:

Stage 1: 85 M image‑text pairs + 4.2 M 30‑second video subtitles (max 30 frames, uniform sampling).

Stage 2: add 22 M instruction data, 24 M FineVision data, 2.7 M 30–60 s subtitles, 0.7 M 60–180 s subtitles (max 90 frames).

Stage 3: add 350 K 10–15 min subtitles (max 384 frames).

Stage 4: enable codec‑stream tokenization for long videos (384 / 768‑frame densities) and incorporate spatial‑reasoning and tracking data.

Each training step uses approximately 50 % codec video, 37.5 % uniformly sampled video, and 12.5 % images, teaching the model to handle multiple visual evidence formats.

Evaluation Results

On 18 video‑understanding tasks the OV‑2‑8B model averages 62.5, on 11 spatial‑reasoning tasks it averages 63.5, and on 4 tracking tasks it averages 48.0 J&F. The gains stem from directing token budget toward motion, occlusion, viewpoint change and event boundaries rather than background.

When the frame budget is tight, codec‑stream input improves temporal localization by up to +9.7 points and can compress 128 k patches to 16 k while preserving key moments. On low‑frame budgets, QVHighlights sees a 15.4‑point boost, and the high‑frequency JumpScore benchmark shows a large advantage.

JumpScore Benchmark

JumpScore requires the model to identify the exact occurrence order of highly repetitive actions (e.g., each rope‑crossing in a jump‑rope video). LLaVA‑OneVision‑2.0 achieves 74.9 mAP, demonstrating that codec‑stream tokenization helps retain fine‑grained temporal memory in near‑identical sequences.

Open Resources

Technical report: https://arxiv.org/abs/2605.25979 GitHub repository: https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2 Model checkpoint:

https://huggingface.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct

Dataset:

https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-2-Data

Project page:

https://evolvinglmms-lab.github.io/LLaVA-OneVision-2

Conclusion

Treating the codec stream as a visual prior lets multimodal models focus computation on genuine changes, improving video understanding, spatial reasoning and tracking without increasing token budget. The code, model weights and large‑scale video‑subtitle data are fully open, enabling further research on longer‑context, streaming perception and next‑generation visual‑language intelligence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

tokenization multimodal video understanding visual language model codec stream JumpScore LLaVA-OneVision-2.0

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.