InfoTok: Information-Theoretic Adaptive Video Tokenizer Redefines Efficient Tokenization (ICLR 2026 Oral)

InfoTok, a collaborative effort by Stanford, NVIDIA Cosmos, and NUS, leverages information theory and an ELBO‑based router to allocate tokens adaptively, achieving 2.3× higher compression, 11× faster inference, and superior reconstruction quality on benchmarks such as TokenBench and DAVIS.

Machine Heart
Machine Heart
Machine Heart
InfoTok: Information-Theoretic Adaptive Video Tokenizer Redefines Efficient Tokenization (ICLR 2026 Oral)

Motivation

Current visual tokenizers use a fixed compression rate, allocating the same number of tokens to both static and complex scenes, which leads to wasted computation and uneven information density. A good video tokenizer should provide high compression, high fidelity, and semantic richness.

Theoretical Foundation

InfoTok draws on Shannon's source coding theorem: predictable (low‑information) content requires fewer bits, while rare, surprising content needs more. The optimal adaptive tokenizer therefore assigns token budgets proportional to the video’s likelihood p(x). This principle is analogous to Huffman coding, where frequent symbols get short codes.

Method

InfoTok introduces two plug‑in components built on top of any fixed‑rate tokenizer (e.g., NVIDIA’s Cosmos tokenizer):

ELBO Router : Uses the evidence lower bound (ELBO) of a pretrained tokenizer as a cheap proxy for predictability, deciding the token budget Nₓ for each video.

Adaptive Compressor : A transformer‑based module that packs the fixed‑length embeddings into a variable‑length token sequence of length Nₓ, omitting low‑information positions and concentrating information when the budget is tight.

The router’s formula (shown in the accompanying diagram) includes a β parameter that controls average compression level. ELBO can be computed without extra models, making the approach lightweight.

Experiments

Qualitative visualizations demonstrate that InfoTok allocates more tokens to dynamic, information‑rich regions and compresses static areas. Quantitative evaluation on TokenBench and DAVIS compares two InfoTok variants (fixed ELBO router and flexible router) against a fixed‑rate baseline and the heuristic ElasticTok.

Key results:

InfoTok saves ~20% of tokens while preserving reconstruction quality.

At 2.3× compression it outperforms ElasticTok across all metrics (PSNR↑, LPIPS↓, FVD↓).

Inference speed is up to 11× faster than comparable adaptive schemes.

Conclusion and Outlook

InfoTok shows that classic information‑theoretic principles can dramatically improve AI efficiency. Future directions include extending the framework to continuous latent spaces, integrating adaptive token depth into video generation pipelines, and applying the approach to images, 3D scenes, and multimodal data.

computer visioninformation theoryadaptive compressionICLR 2026InfoTokELBOvideo tokenization
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.