Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens
LongCat-Next, a 3‑billion‑parameter multimodal model released by Meituan, adopts a pure discrete token‑based architecture (DiNA) and next‑token prediction, outperforming same‑size rivals on OmniDocBench‑EN, CharXivRQ, and matching QwenVL on visual tasks, while avoiding catastrophic forgetting and achieving a SWE‑Bench score of 43.0, as demonstrated through extensive benchmarks, receipt extraction, OCR, audio dialect reasoning, and image generation experiments.
The LongCat‑Next model, open‑sourced by Meituan’s LongCat team, introduces a fully discrete multimodal foundation built on the LongCat‑Flash‑Lite MoE base with only 3 B activation parameters. It follows the simplest next‑token prediction (NTP) paradigm, treating code, high‑resolution images, and noisy audio uniformly as discrete tokens.
DiNA Architecture : The model implements Discrete Native Autoregression (DiNA), which unifies representations of all modalities into a common token space. T‑SNE visualizations show tightly interwoven embeddings across text, audio, and vision, confirming the convergence of heterogeneous signals.
Vision Tokenization – dNaViT : LongCat‑Next’s novel Discrete‑Native Vision Transformer (dNaViT) converts continuous visual signals into homogeneous discrete tokens. It employs Residual Vector Quantization (RVQ), recursively fitting residuals across codebooks to achieve a 28× compression ratio while preserving high‑frequency details. The tokenized visual features can be processed at arbitrary resolutions, enabling strong performance on complex chart reasoning tasks.
Generation Head – Depth Transformer : Multi‑modal token streams are summed before entering a Depth Transformer, which serves as the multimodal prediction head without adding overhead to the front‑end encoder.
Semantic Alignment Encoder (SAE) : To mitigate semantic loss during discretization, a Semantic Alignment Encoder aligns token representations globally through multi‑task dense learning, ensuring that generated tokens retain recoverable information.
Dual‑Path Detokenization : For decoding, LongCat‑Next separates the process into two tracks. The first track uses a ViT‑based structural pixel decoder to generate low‑resolution anchor maps, preserving global layout. The second track, a Diffusion Refiner, injects ultra‑high‑frequency details, allowing accurate reconstruction of intricate mathematical formulas and OCR‑level text fidelity.
Benchmark Results : On the OmniDocBench‑EN and CharXivRQ leaderboards, LongCat‑Next surpasses the same‑size Qwen3‑Omni‑A3B across all metrics. Its visual understanding matches the specialized QwenVL model, and it attains a SWE‑Bench score of 43.0, indicating strong code‑generation capability.
Practical Evaluations : Experiments include extracting structured JSON from a supermarket receipt (handling noisy numeric patterns), precise settlement logic verification, reading a Sichuan‑dialect audio reasoning task, and generating a bilingual meeting notice with natural prosody. Image generation tests produce a children’s book cover with flawless typography placement and a high‑fidelity OCR‑friendly rendering of complex charts.
Conclusion : By converting continuous visual and auditory signals into a unified discrete token space, LongCat‑Next demonstrates that a modest‑size (3 B) model can achieve cross‑modal understanding and generation without resorting to large heterogeneous modules. The code, model weights, and full technical report are publicly available, offering a valuable reference for researchers tackling multimodal fusion and token‑level modeling.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
