FlexTok: Reconstruct Images with as Few as 8 Tokens – Variable‑Length Tokenizer Beats TiTok
FlexTok is a flexible‑length 1‑D image tokenizer that can resample pictures into as few as 1‑256 discrete tokens, achieving superior reconstruction (FID) and autoregressive generation quality compared with TiTok, thanks to nested random dropout, causal masks and a flow‑based decoder evaluated on ImageNet and DFN.
Overview
FlexTok projects a 2‑D image into an ordered 1‑D token sequence whose length can vary from 1 to 256 tokens. By resampling an image into as few as eight tokens, the method attains reconstruction quality that surpasses TiTok while using far fewer tokens.
Method
FlexTok consists of a Vision Transformer (ViT) encoder with register tokens, a finite‑scalar quantizer (FSQ) that discretises the registers, and a flow‑based decoder that reconstructs the image from any subset of tokens.
1. 1‑D Tokenisation with Flow Decoder
The encoder maps VAE latent blocks to register tokens, which serve as a bottleneck. FSQ quantises these registers into discrete tokens. The decoder receives the quantised tokens together with a noisy VAE latent block and predicts a flow, minimising a flow‑loss. Adding a REPA loss between the decoder’s intermediate layers and DINOv2‑L features (Yu et al., 2023) speeds convergence and improves downstream generation (see Table 3).
2. Learning Variable‑Length Ordered Tokens
Unlike TiTok, FlexTok does not fix the token count. Nested random dropout (Rippel et al., 2014; Kusupati et al., 2022) randomly discards a suffix of the register tokens during training, forcing the encoder to compress image information hierarchically. Two dropout strategies are explored: uniform sampling of the retained token count, and uniform sampling from an exponentially‑spaced set to avoid starving the last tokens of gradient updates.
3. Causal Attention Mask
A causal mask is optionally applied to the registers, enforcing a strict left‑to‑right dependency so that earlier tokens capture coarse concepts and later tokens add finer details. This mask also enables efficient encoding when the desired token budget is known in advance.
4. Autoregressive Generation
FlexTok tokens are fed to a GPT‑style autoregressive Transformer (LlamaGen‑inspired) for both class‑conditional (ImageNet‑1k) and text‑conditional (DFN) generation. Learned absolute positional embeddings replace 2‑D RoPE because the token stream is 1‑D. The model scales from 1 B to 3 B parameters; larger models improve alignment for long token sequences but have little impact on the first few tokens.
Experiments & Results
Reconstruction quality is measured by rFID, MAE and DreamSim on ImageNet‑1k validation. FlexTok achieves reasonable reconstructions with a single token and approaches state‑of‑the‑art quality with 8‑128 tokens, outperforming TiTok at comparable token budgets (see Fig. 5).
Generation experiments show a “visual vocabulary” effect: as more tokens are predicted, the image becomes increasingly specific. Simple prompts (e.g., “a red apple”) need only 4‑16 tokens, while complex prompts (e.g., “a rocket ship graffiti”) require the full 256‑token budget. Alignment metrics (DINOv2‑L top‑1 accuracy for class‑conditional, CLIPScore for text‑conditional) improve with token count, while generative FID remains stable (Fig. 7, Fig. 9).
System‑level comparison (Table 1) demonstrates that FlexTok dominates baselines across token budgets, delivering higher fidelity reconstructions and better text‑image alignment despite using a single model.
Conclusion
FlexTok proves that a flexible‑length 1‑D tokenizer can compress images into a hierarchical token stream, enabling high‑quality reconstruction with very few tokens and progressive, fine‑grained autoregressive generation. The approach opens research directions for adaptive compute budgets in image synthesis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
