Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens
LongCat-Next, a 3‑billion‑parameter multimodal model released by Meituan, adopts a pure discrete token‑based architecture (DiNA) and next‑token prediction, outperforming same‑size rivals on OmniDocBench‑EN, CharXivRQ, and matching QwenVL on visual tasks, while avoiding catastrophic forgetting and achieving a SWE‑Bench score of 43.0, as demonstrated through extensive benchmarks, receipt extraction, OCR, audio dialect reasoning, and image generation experiments.
