Artificial Intelligence 21 min read

Unified Multimodal Modeling: How LongCat-Next Bridges Understanding and Generation

The article analyzes why text models naturally combine understanding and generation, explains the fundamental conflicts that prevent images from sharing the same tokenization, and details LongCat-Next’s discrete autoregressive approach—using SAE visual encoders, residual vector quantization, and a unified LLM backbone—to achieve a single model that can both comprehend and create multimodal content.

Machine Learning Algorithms & Natural Language Processing

Mar 31, 2026

Unified Multimodal Modeling: How LongCat-Next Bridges Understanding and Generation

In computer‑vision research there are two historically separate model families: the "understanding" stream (e.g., LLaVA, InternVL) that can interpret images but cannot generate them, and the "generation" stream (e.g., Stable Diffusion, DALL‑E) that can create images but cannot understand them. Text models avoid this split because a single next‑token prediction simultaneously handles both tasks.

Three intrinsic properties of text make this possible:

The same token is used for reading and writing; tokenization is lossless and reversible.

All semantics reside at a single level—"cat" is both the minimal unit for understanding and generation.

Tokens are 1‑D, discrete, and finite, matching the causal attention of Transformers.

Images conflict with these three dimensions:

Tokenization: BPE (lossless) vs. VQ‑VAE/VAE (lossy).

Semantic level: understanding needs high‑level semantics, generation needs low‑level pixels.

Data structure: 1‑D discrete finite vs. 2‑D continuous infinite.

Consequently, many works aim to find a unified representation or an architecture that sidesteps the incompatibility. Current native multimodal models can be examined along two axes: how the LLM draws images (the "generation paradigm") and whether the visual encoder is shared between understanding and generation.

Generation Paradigms

Pure Diffusion (Flow) – models such as UniModel[2] and FUDOKI[3] perform understanding and generation within a diffusion framework.

Pure Autoregressive – models like Lumina‑mGPT[4], Janus‑Pro[5], OneCAT[6] discretize images into tokens and predict them sequentially.

AR + External Diffusion – BLIP3‑o[7], OmniGen2[8], UniWorld‑V1[9] let the LLM reason while a separate diffusion model draws the picture.

AR + Diffusion/Flow Fusion – Mogao[10], BAGEL[11], Show‑o2[12], InternVL‑U[13] combine autoregressive loss for text with diffusion loss for images inside the same Transformer.

Encoding Strategies

Shared Encoder – a single VQ tokenizer is used for both understanding and generation (e.g., Lumina‑mGPT, OneCAT).

Decoupled Encoder – semantic encoders (ViT, SigLIP) handle understanding while VAE/VQ handles generation.

Semantic‑Pixel Fusion – a single encoder extracts both semantic and pixel features (Show‑o2, NEO).

Learnable Query Bridge – LLM outputs a learnable query that is fed to an external diffusion model (BLIP3‑o).

Industry trend: most production‑grade models abandon a single token set, moving to decoupled or fused schemes, and AR + Diffusion/Flow fusion has become the dominant generation paradigm because text naturally fits autoregressive modeling while images fit diffusion.

LongCat‑Next

LongCat‑Next[14] deliberately chooses the discrete autoregressive route. Its core belief is “representational symmetry”: image tokens and text tokens should enjoy equal status—identical form, training paradigm, and loss function.

The authors identify two fundamental bottlenecks when discretizing images:

Representation‑capacity bottleneck: which encoder extracts visual features?

Quantization‑loss bottleneck: how to minimise information loss during discretisation?

Visual Encoder SAE (Semantic‑and‑Aligned Encoder)

The paper surveys three existing visual encoder families:

Reconstruction VAE – excels at pixel‑level fidelity but lacks high‑level semantics.

Self‑supervised semantic encoders (DINOv2, SigLIP) – provide strong semantic features but either miss language alignment (DINOv2) or are trained to discriminate image‑text pairs rather than capture fine‑grained generation details (SigLIP).

Raw pixels – extremely redundant and computationally expensive.

The authors propose a fourth class: SAE , trained with large‑scale vision‑language joint pre‑training. SAE offers two key properties:

Semantic Completeness – knows not only that the object is a cat, but also its pose, context, and relationships.

Language Affinity – its feature space aligns naturally with LLM text space, eliminating the need for extra adaptation layers.

Residual connections in models such as QwenViT preserve low‑level visual signals, meaning that even without explicit reconstruction supervision the SAE features retain considerable pixel‑level information.

Residual Vector Quantization (RVQ)

Standard VQ maps a continuous vector to the nearest codebook entry, incurring a hard ceiling because only one codeword can approximate the vector. RVQ overcomes this by quantising the residual error across multiple codebooks, progressively refining the representation.

第 1 级：z → 在 codebook₁ 中量化 → 码字 c₁   残差 r₁ = z - c₁
第 2 级：r₁ → 在 codebook₂ 中量化 → 码字 c₂   残差 r₂ = r₁ - c₂
第 3 级：r₂ → 在 codebook₃ 中量化 → 码字 c₃   残差 r₃ = r₂ - c₃
...
重建：z ≈ c₁ + c₂ + c₃ + ...

Analogy: VQ says “red”, losing the orange tint; RVQ first says “red”, then adds “orange‑tint”, then “warm”, preserving finer detail.

Each level captures increasingly fine details; the first level provides coarse semantics sufficient for understanding, while later levels add texture and pixel‑level information needed for high‑quality generation. Theoretically, adding more levels can arbitrarily approach continuous representations.

dNaViT: Discrete Native‑resolution Vision Transformer

Combining SAE with RVQ yields dNaViT (discrete Native‑resolution Vision Transformer), a unified visual tokenizer that:

Encodes arbitrary‑resolution images into a discrete ID sequence with up to 28× compression.

Supports de‑tokenisation back to the original image.

Preserves native aspect ratio without forced resizing.

Because each image location now has multiple RVQ tokens, the authors introduce Additive Encoding —summing the embeddings of all levels instead of concatenating—so the sequence length stays constant. A lightweight DepthTransformer decodes all levels in a single autoregressive step, keeping inference speed comparable to text token processing.

Audio Modality

The same design is extended to audio: Whisper serves as the SAE, RVQ discretises audio at 12.5 Hz (≈12.5 tokens per second), and a paired decoder plus flow‑matching network reconstructs high‑fidelity waveforms.

DiNA Paradigm: Discrete Native Autoregression

All modality tokenizers (dNaViT for vision, Whisper+RVQ for audio, BPE for text) feed a single LLM backbone—LongCat‑Flash MoE. The backbone sees only a stream of token IDs, is modality‑agnostic, and is trained with a single cross‑entropy loss for next‑token prediction.

Key findings from experiments:

No performance ceiling for discrete representations – scaling data reduces the loss gap between discrete and continuous models to ~1%.

Minimal conflict between understanding and generation – adding generation loss raises understanding loss by only 0.006; adding understanding loss reduces generation loss by 0.02.

MoE experts self‑specialise by modality despite modality‑agnostic routing, mirroring the soft‑division of labor seen in human cognition.

Visual and text tokens intermix in embedding space , supporting the Platonic Representation Hypothesis that different modalities converge to a shared latent space.

Reinforcement Learning Compatibility

Because the latent space is discrete, each token prediction is a natural action, allowing straightforward RL fine‑tuning (e.g., GRPO) without converting deterministic ODE sampling (required by diffusion) into stochastic SDE sampling. The authors design a multi‑dimensional reward model (overall ability, OCR accuracy, semantic alignment, image quality) and a sequence‑level filter to prevent entropy explosion during RL training.

Overall, LongCat‑Next demonstrates that a unified discrete token representation for images (and audio) is feasible, enabling a single autoregressive LLM to both understand and generate multimodal content efficiently and scalably.

tokenization Vision-Language multimodal models LongCat-Next dNaViT RVQ unified generation

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.