FlexTok: Reconstruct Images with as Few as 8 Tokens – Variable‑Length Tokenizer Beats TiTok

FlexTok is a flexible‑length 1‑D image tokenizer that can resample pictures into as few as 1‑256 discrete tokens, achieving superior reconstruction (FID) and autoregressive generation quality compared with TiTok, thanks to nested random dropout, causal masks and a flow‑based decoder evaluated on ImageNet and DFN.

AIWalker
AIWalker
AIWalker
FlexTok: Reconstruct Images with as Few as 8 Tokens – Variable‑Length Tokenizer Beats TiTok

Overview

FlexTok projects a 2‑D image into an ordered 1‑D token sequence whose length can vary from 1 to 256 tokens. By resampling an image into as few as eight tokens, the method attains reconstruction quality that surpasses TiTok while using far fewer tokens.

Method

FlexTok consists of a Vision Transformer (ViT) encoder with register tokens, a finite‑scalar quantizer (FSQ) that discretises the registers, and a flow‑based decoder that reconstructs the image from any subset of tokens.

1. 1‑D Tokenisation with Flow Decoder

The encoder maps VAE latent blocks to register tokens, which serve as a bottleneck. FSQ quantises these registers into discrete tokens. The decoder receives the quantised tokens together with a noisy VAE latent block and predicts a flow, minimising a flow‑loss. Adding a REPA loss between the decoder’s intermediate layers and DINOv2‑L features (Yu et al., 2023) speeds convergence and improves downstream generation (see Table 3).

2. Learning Variable‑Length Ordered Tokens

Unlike TiTok, FlexTok does not fix the token count. Nested random dropout (Rippel et al., 2014; Kusupati et al., 2022) randomly discards a suffix of the register tokens during training, forcing the encoder to compress image information hierarchically. Two dropout strategies are explored: uniform sampling of the retained token count, and uniform sampling from an exponentially‑spaced set to avoid starving the last tokens of gradient updates.

3. Causal Attention Mask

A causal mask is optionally applied to the registers, enforcing a strict left‑to‑right dependency so that earlier tokens capture coarse concepts and later tokens add finer details. This mask also enables efficient encoding when the desired token budget is known in advance.

4. Autoregressive Generation

FlexTok tokens are fed to a GPT‑style autoregressive Transformer (LlamaGen‑inspired) for both class‑conditional (ImageNet‑1k) and text‑conditional (DFN) generation. Learned absolute positional embeddings replace 2‑D RoPE because the token stream is 1‑D. The model scales from 1 B to 3 B parameters; larger models improve alignment for long token sequences but have little impact on the first few tokens.

Experiments & Results

Reconstruction quality is measured by rFID, MAE and DreamSim on ImageNet‑1k validation. FlexTok achieves reasonable reconstructions with a single token and approaches state‑of‑the‑art quality with 8‑128 tokens, outperforming TiTok at comparable token budgets (see Fig. 5).

Generation experiments show a “visual vocabulary” effect: as more tokens are predicted, the image becomes increasingly specific. Simple prompts (e.g., “a red apple”) need only 4‑16 tokens, while complex prompts (e.g., “a rocket ship graffiti”) require the full 256‑token budget. Alignment metrics (DINOv2‑L top‑1 accuracy for class‑conditional, CLIPScore for text‑conditional) improve with token count, while generative FID remains stable (Fig. 7, Fig. 9).

System‑level comparison (Table 1) demonstrates that FlexTok dominates baselines across token budgets, delivering higher fidelity reconstructions and better text‑image alignment despite using a single model.

Conclusion

FlexTok proves that a flexible‑length 1‑D tokenizer can compress images into a hierarchical token stream, enabling high‑quality reconstruction with very few tokens and progressive, fine‑grained autoregressive generation. The approach opens research directions for adaptive compute budgets in image synthesis.

System level comparison
System level comparison
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

autoregressive generationimage tokenizationVision TransformerFlexTokflow decodervariable-length tokens
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.