Do All Physical Signals Reduce to a Single Discrete Token? LongCat‑Next Explained
LongCat‑Next, Meituan’s new 3‑billion‑parameter foundation model, adopts a pure‑discrete DiNA architecture with next‑token prediction, converting vision, audio and text into unified tokens; it surpasses same‑size multimodal models on OmniDocBench‑EN, CharXivRQ and SWE‑Bench, avoids catastrophic forgetting, and introduces dNaViT, RVQ compression and a dual‑path detokenizer for high‑fidelity generation.
All physical signals—text, images, audio—are traditionally modeled with heterogeneous modules, especially because visual data is continuous and hard to fit into autoregressive frameworks. Existing solutions add complex spatial encodings or heterogeneous components, which quickly work but blur the logical unity of the model.
Meituan’s LongCat‑Next, released by the LongCat team, tackles this by adopting a pure‑discrete next‑token prediction (NTP) paradigm called Discrete Native Autoregression (DiNA). The model treats every modality as a sequence of discrete tokens, unifying their representation at the lowest level.
Core Architecture
Built on the LongCat‑Flash‑Lite MoE base with only 3 B activation parameters.
Uses a Discrete Native Vision Transformer (dNaViT) to tokenize visual inputs directly into discrete tokens, supporting arbitrary resolutions.
dNaViT incorporates Residual Vector Quantization (RVQ), achieving a 28× compression ratio by recursively fitting residuals across codebooks.
A Semantic Alignment Encoder (SAE) aligns multimodal token spaces through global alignment and dense multi‑task learning, preserving high‑level semantics.
During generation, a Depth Transformer serves as the multimodal prediction head, while a Dual‑Path Detokenizer decouples low‑resolution structural generation (ViT‑based pixel decoder) from high‑frequency detail restoration (Diffusion Refiner).
The dual‑path design keeps the front‑end encoder lightweight and enables parallel decoding of multi‑level tokens without extra burden.
Benchmark Performance
On OmniDocBench‑EN (document parsing) and CharXivRQ (chart understanding), LongCat‑Next outperforms the same‑size multimodal model Qwen3‑Omni‑A3B across all metrics.
Its visual understanding matches the specialized QwenVL model of comparable size.
LongCat‑Next avoids catastrophic forgetting, retaining the deep logical reasoning of language models.
On SWE‑Bench, it scores 43.0, indicating strong code‑generation capability.
Qualitative Evaluations
Receipt extraction: the model parses a supermarket receipt with corrections, outputs structured JSON, and correctly reconciles discount logic.
OCR robustness: it accurately reproduces complex mathematical formulas without text distortion.
PPL chart analysis: when fed a perplexity curve from the YaRN paper, LongCat‑Next reproduces the trend and conclusions without hallucination.
Image generation: given a prompt for a children’s book cover with layout and font specifications, the generated image respects typography and placement.
Audio understanding: the model comprehends a logical reasoning question spoken in Sichuan dialect and provides a correct reasoning trace.
Speech synthesis: it produces natural‑sounding bilingual (Chinese‑English) meeting notices with seamless prosody.
All code, model weights, and the full technical report are publicly available on GitHub and HuggingFace.
LongCat‑Next demonstrates that a unified discrete token foundation can deliver high‑quality multimodal performance while keeping the model compact. It offers a promising direction for researchers seeking to reduce reliance on large heterogeneous stacks and instead pursue architectural unification.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
