Artificial Intelligence 23 min read

FlexTok Achieves High‑Quality Visual Reconstruction with as Few as 8 Tokens, Outperforming TiTok

FlexTok introduces a variable‑length 1‑D image tokenizer that can reconstruct images with as few as eight tokens, surpasses TiTok in FID and MAE across multiple token budgets, and serves as a hierarchical visual vocabulary for autoregressive image generation.

AIWalker

Feb 22, 2025

FlexTok Achieves High‑Quality Visual Reconstruction with as Few as 8 Tokens, Outperforming TiTok

Introduction

Recent advances in image generation have shown that autoregressive (AR) models can scale to billions of parameters, but the tokenization step remains a bottleneck. Traditional 2‑D grid tokenizers waste redundancy, while 1‑D approaches such as TiTok reduce this overhead but use a fixed token count, limiting adaptability to image complexity. FlexTok (FlexTok: Resampling Images into 1D Token Sequences of Flexible Length) addresses this by projecting a 2‑D image into an ordered, variable‑length sequence of discrete tokens.

Method and Model

FlexTok is built as a VAE‑based autoencoder with a discrete 1‑D bottleneck. A Vision Transformer (ViT) encoder maps image patches to a set of learnable register tokens . These registers are quantized by a Finite Scalar Quantizer (FSQ) and serve as conditioning for a correction‑flow decoder . The decoder receives the quantized registers concatenated with a noisy VAE latent block and predicts a flow that reconstructs the image. Nested random dropout (Rippel et al., 2014) is applied to the register sequence during training, forcing the encoder to learn an ordered compression where early tokens capture high‑level semantics and later tokens add finer details.

Variable‑Length Ordered Tokenization

Unlike TiTok and ALIT, which require a separate tokenizer for each token budget, FlexTok can operate with any number of tokens between 1 and 256. Simple images may be represented with as few as 32 tokens, while complex scenes need more. Two nested‑dropout strategies are explored: (1) uniformly sampling the number of tokens to keep, and (2) sampling from an exponentially growing set (e.g., {1,2,4,8,…,256}) to avoid starving the decoder of later‑stage tokens.

Causal Attention Mask

A causal mask can be added to the registers so that token i may attend only to tokens ≤ i . This enforces a strict left‑to‑right dependency, matching the AR generation order and allowing users to pre‑specify a maximum token budget.

Autoregressive Image Generation

FlexTok token sequences are fed to a GPT‑style Transformer trained on ImageNet‑1k (category‑conditional) and DFN‑2B (text‑conditional). The model predicts tokens sequentially, gradually refining the image from coarse concepts (e.g., “car present”) to fine details (e.g., “red sedan, chrome rims”). Experiments show that with only 8–16 tokens the model already produces recognizable images, and quality improves steadily up to 256 tokens.

Implementation Details

Training proceeds in three stages:

Stage 0 – VAE pre‑training: An SDXL‑style VAE (Lombach et al., 2022) is trained on DFN with 4, 8, 16‑channel variants; the 16‑channel model with 8× down‑sampling is used downstream.

Stage 1 – FlexTok tokenizer: A Transformer encoder/decoder with up to 256 registers quantized by a 6‑dimensional FSQ (vocab ≈ 64 k). Nested dropout and causal masks are applied; the decoder uses adaLN‑zero conditioning and REPA loss (Yu et al., 2024b) to accelerate convergence.

Stage 2 – Autoregressive Transformer: A LlamaGen‑inspired AR model (Sun et al., 2024) with RMSNorm and SwiGLU, using learned absolute positional embeddings (no RoPE) predicts the FlexTok token stream. Parameter counts range from 1 B to 3 B for category‑conditional and up to 30 B for text‑conditional experiments.

Experiments and Results

FlexTok is evaluated on reconstruction (FID, MAE, DreamSim) and generation (CLIPScore, DINOv2‑L accuracy, gFID) across token budgets.

Variable‑Length Tokenization: With a single token FlexTok can already generate a plausible image; increasing tokens steadily lowers rFID and improves MAE. Figure 5 shows the rate‑distortion trade‑off for three model sizes (d12‑d12, d18‑d18, d24‑d24).

Coarse‑to‑Fine Generation: Category‑conditional models achieve high DINOv2‑L top‑1 accuracy with ~32 tokens, while text‑conditional models need up to 256 tokens for full CLIPScore saturation (Figure 7).

Token Budget vs Prompt Complexity: Simple prompts (“red apple”) require 4–16 tokens; detailed prompts (“rocket ship graffiti”) need the full 256‑token budget (Figure 8).

Scaling the AR Model: Larger AR models improve CLIPScore and gFID for long token sequences (> 64 tokens) but have little impact on early‑token generations (Figure 9).

System‑Level Comparison: Compared against a 2‑D grid tokenizer with a flow‑matching decoder, FlexTok consistently outperforms on FID for token budgets 2–128 and achieves higher CLIPScore at 256 tokens, despite a slight increase in global FID (Table 1, Figures 6‑9).

Conclusion

FlexTok demonstrates that a flexible‑length 1‑D tokenizer can compress images into a hierarchical visual vocabulary, enabling high‑fidelity reconstruction with as few as eight tokens and supporting progressive, coarse‑to‑fine autoregressive generation. The approach opens research directions for adaptive compute budgeting and more efficient image generation pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI research autoregressive generation image tokenization FlexTok variable-length tokens

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.