Generating High‑Resolution Images with Only 64 Tokens: How MacTok Overcomes Posterior Collapse
MacTok introduces semantic masking and dual‑space alignment to prevent posterior collapse in continuous image tokenizers, enabling high‑quality generation with just 64‑128 tokens and achieving strong gFID scores on ImageNet at 256×256 and 512×512 resolutions.
In visual generation, improving inference efficiency hinges on compressing images into fewer tokens. Image tokenizers fall into two camps: discrete (e.g., VQ‑VAE, VQ‑GAN) and continuous (e.g., KL‑VAE). Continuous tokenizers retain a smooth latent space and support gradient optimization, but when aggressively compressed (e.g., using only 64 tokens) they suffer severe posterior collapse caused by strong KL regularization, leading to loss of detail in reconstructed images.
MacTok: Masked Augmenting 1D Continuous Tokenizer
The authors propose MacTok, which stems from the insight that the latent space must learn truly "semantically rich" representations to eliminate posterior collapse. MacTok adopts a dual‑pipeline design that simultaneously operates on the image space and the representation space.
Image‑Space Design
Random Masking : provides a basic reconstruction challenge, encouraging the model to build robust mappings.
Semantic Masking : evaluates the semantic importance of each image region and preferentially masks the most informative areas, forcing the model to infer high‑semantic content rather than merely filling in texture.
The optimal image‑mask ratio is found to be around 70%.
Representation‑Space Design
MacTok introduces global and local alignment . Latent tokens are aligned locally with fine‑grained patch features and globally with a condensed CLS feature, directly constraining the semantic structure of the latent space.
Experimental Results
On ImageNet 256×256, MacTok with 64 tokens achieves a gFID of 1.44, and with 128 tokens a gFID of 1.52, compressing the token sequence by a factor of 64 compared to traditional methods. Compared to SoftVQ‑VAE, the 64‑token version reduces gFID by 2.21; the 128‑token version matches the baseline REPA (gFID ≈ 1.42).
On ImageNet 512×512, both the 64‑token and 128‑token variants maintain a gFID of 1.52 and an Inception Score above 306, demonstrating that performance does not degrade despite the larger resolution.
Visualization
UMAP dimensionality reduction visualizations show that MacTok preserves distinct semantic clusters in latent space, whereas prior methods exhibit severe overlap due to collapse.
Conclusion
By integrating semantic masking in the image domain with global‑local alignment in the representation domain, MacTok creates a robust learning environment that prevents posterior collapse without altering the underlying tokenizer architecture. This dual‑space strategy offers a practical reference for future high‑efficiency, high‑quality image generation models.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
