How LongCat-Next Redefines Multimodal AI with Discrete Tokens

The LongCat-Next model from Meituan introduces a native multimodal architecture that uses discrete tokenization for vision and audio, achieving unified understanding and generation across modalities while delivering state‑of‑the‑art benchmark performance and simplifying training pipelines.

PaperAgent
PaperAgent
PaperAgent
How LongCat-Next Redefines Multimodal AI with Discrete Tokens

01 Multimodal Translation Dilemma

Most existing multimodal models translate images and audio into continuous feature vectors and then project them into a language model’s embedding space, which inevitably loses information and reduces efficiency. LongCat-Next proposes to treat all modalities as discrete tokens so that they share the same token space from the start.

Figure 2: LongCat-Next architecture overview
Figure 2: LongCat-Next architecture overview

02 Discretizing Vision Is Harder Than Expected

Vision is continuous and high‑dimensional, making direct tokenization challenging. The LongCat team created dNaViT , which relies on a Semantic‑Aligned Encoder (SAE) trained on large image‑text pairs to produce visual “words”. They then apply Residual Vector Quantization (RVQ) with multiple layers to compress the representation, achieving up to 28× compression while preserving semantics.

Figure 3: dNaViT design overview
Figure 3: dNaViT design overview
Figure 4: Tokenizer and Detokenizer training flow
Figure 4: Tokenizer and Detokenizer training flow
Figure 5: Reconstruction ability comparison of different encoders
Figure 5: Reconstruction ability comparison of different encoders

03 Audio Tokenization and Cross‑Modal Alignment

The audio branch uses a Whisper‑based encoder followed by an 8‑layer RVQ to produce discrete audio tokens. An internal language‑guidance mechanism aligns each audio segment with a corresponding text prompt, enabling both serial (text‑then‑audio) and parallel (simultaneous) generation.

Figure 6: Audio Tokenizer framework
Figure 6: Audio Tokenizer framework
Figure 7: Two speech generation strategies
Figure 7: Two speech generation strategies

04 Benchmark Results Break Conventional Biases

On visual tasks such as STEM reasoning, OCR, and document understanding, LongCat‑Next matches or surpasses specialized vision models, topping MathVista (83.1) and MathVision (64.7). For text‑to‑image generation, it outperforms dedicated models on GenEval, DPG‑Bench, and LongText.

Figure 1: Benchmark performance comparison
Figure 1: Benchmark performance comparison
Figure 2: Visual benchmark comparison
Figure 2: Visual benchmark comparison
Figure 2: Generation quality comparison with specialized T2I models
Figure 2: Generation quality comparison with specialized T2I models

Audio benchmarks show MMAU 76.40, ASR word‑error rate 1.47 % on AISHELL‑1, and TTS SeedTTS Chinese score 1.90, rivaling top‑tier models like Gemini 3.1 Flash‑Lite and Qwen‑3 Omni—all achieved within a single unified backbone.

Figure 4: Audio benchmark comparison
Figure 4: Audio benchmark comparison
Figure 3: Unified multimodal model comparison
Figure 3: Unified multimodal model comparison

05 Towards a Platonic Representation Hypothesis

t‑SNE visualizations reveal that LongCat‑Next’s visual and textual tokens intermix in the same embedding space, unlike non‑native models where they remain separated. The team calls this the “Platonic Representation Hypothesis”, suggesting that different modalities are merely different projections of a shared reality.

Figure 12: Modal feature distribution comparison
Figure 12: Modal feature distribution comparison

Conclusion

Unifying all modalities as discrete tokens brings three major benefits: (1) Simplified architecture and engineering—training and deployment follow the mature language‑model pipeline; (2) Capability sharing—understanding and generation become two sides of the same token‑prediction problem; (3) Data expansion—any image, text, audio, or video can be converted into a uniform token stream, unlocking larger self‑supervised datasets.

Figure 13: Staged training process
Figure 13: Staged training process

Current limitations include compute and data scale, leaving open research directions such as longer cross‑modal context, multi‑turn multimodal dialogue, and finer‑grained interactive generation.

GitHub: https://github.com/meituan-longcat/LongCat-Next
HuggingFace: https://huggingface.co/meituan-longcat/LongCat-Next
demo: https://longcat.chat/longcat-next
AIbenchmarktokenizationMeituan
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.