LongCat-Next: Turning Images, Audio, and Text into Tokens – What’s Next?
LongCat-Next is a 68.5‑billion‑parameter discrete‑native autoregressive multimodal model that tokenizes images, audio and text, challenges the belief that visual tokenization loses detail, matches specialized models on fine‑grained tasks, and demonstrates that joint understanding‑generation training can even improve generation quality.
Model Overview
Meituan recently released LongCat-Next, a discrete‑native autoregressive (DiNA) multimodal large model built on the LongCat‑Flash‑Lite MoE architecture. It contains 68.5 B total parameters with only 3 B active parameters and processes text, images and audio within a single token‑based framework.
Challenging Conventional Wisdom
The authors note that the long‑standing belief in the multimodal community is that discretizing visual information into tokens inevitably discards fine‑grained details, making such models weaker than continuous‑feature models on OCR or complex chart understanding. LongCat-Next is presented as the first pure‑discrete model that reaches visual understanding performance comparable to dedicated continuous models such as Qwen3‑VL‑A3B.
Performance Highlights
On image generation, LongCat-Next’s long‑text understanding and rendering surpass other unified models and rival the dedicated text‑to‑image model Flux‑dev.
In audio, its speech recognition and understanding outperform Gemini 3.1 Flash‑Lite preview and MiMo‑Audio of similar scale.
Qualitative Case Studies
Visual Understanding : Given a picture of a floral arrangement inspired by "La La Land", the model correctly identified yellow multi‑head roses, purple lisianthus, sage‑type herbs and accompanying foliage, and also described the overall color style.
Landmark Recognition : When prompted with three distinct Chinese city landmarks, LongCat‑Next accurately named Beijing’s "Wangjing Eye", Guangzhou’s "Bank of China Tower" (nicknamed "Cockroach Tower") and Nanjing’s Youth Olympic Center, providing background details such as the architect Zaha Hadid for the latter.
Reasoning Puzzle : For a visual logic puzzle where each figure consists of an outer frame and internal black dots, the model discovered the hidden rule “frame sides – dot count = 2” and selected option B.
Image Generation : Using a prompt describing a crystal‑clear mountain lake at sunrise, the model produced a composition with professional‑grade lighting and perspective. It also rendered a minimalist product photo of a white mug with the text "LongCat‑Next" without distortion, and generated a vivid Santorini scene with striking blue‑white contrast.
Audio Understanding & Synthesis : When asked a classic logic puzzle in Sichuan dialect, the model recognized the speech accurately, preserved the dialectal semantics, and performed logical inference without loss. It identified environmental sounds (e.g., train station noises) and inferred the recording location. It also captured speaker emotion (elevated volume and rapid speech indicating anger). For voice cloning, a Cantonese‑accented Mandarin reference was used to synthesize new content while preserving timbre, and the same procedure worked for English.
Tokenizing Vision: dNaViT
The visual tokenizer, named dNaViT (Discrete Native Resolution Vision Transformer), consists of three components:
SAE (Semantic Alignment Encoder) : a large‑scale vision‑language pretrained encoder that captures high‑level semantics while retaining fine‑grained visual attributes.
RVQ Compression (Residual Vector Quantization) : a multi‑layer cascade that quantizes residual errors to map continuous features into a finite discrete codebook, balancing compression rate and fidelity.
dNaViT Native Resolution : processes images at their original resolution, producing variable‑length token sequences and avoiding information loss from fixed‑size cropping.
Residual connections within the encoder act as a “preservation channel”, allowing low‑level pixel information to bypass high‑level semantic layers, which the authors observed empirically to aid image reconstruction.
From Tokens Back to Images
During decoding, discrete code embeddings are fed into a Vision‑Transformer‑based pixel decoder that restores spatial layout and object structure. A subsequent image refiner, trained with flow‑matching, enriches texture and high‑frequency details, yielding high‑quality outputs. The process can be summarized as “structure restoration → visual enhancement”.
Audio Tokenization and Synthesis
Audio signals are first processed by Whisper encoders to extract semantic and paralinguistic features, then down‑sampled and compressed via RVQ into discrete audio tokens. The decoder first reconstructs a coarse mel‑spectrogram and then refines it with a flow‑matching model, significantly improving acoustic fidelity.
Unified Modeling Benefits
All modalities are converted to a shared token space before entering a Modality‑Agnostic MoE decoder. This eliminates the need for separate visual, audio, or cross‑modal alignment modules; the backbone becomes a single token → shared embedding → autoregressive model, with modality‑specific components only at the input and output ends.
Joint training of understanding and generation under the same autoregressive objective does not cause capacity conflict. Experiments show that with equal token budgets, the loss of a unified model differs from a pure‑understanding model by only 0.006 and is 0.02 lower than a pure‑generation model, indicating that understanding tasks can even boost generation quality.
Generation Strategies
Serial Generation : generate a guiding text segment first, then the corresponding audio segment, reducing cross‑modal interference.
Parallel Generation : generate text and audio tokens simultaneously, delaying the first audio token to maintain alignment, which improves latency for real‑time dialogue.
Both strategies are trained with a random‑delay paradigm, randomly sampling alignment delays during training to enhance robustness. Experiments confirm that parallel generation matches serial generation in efficiency and semantic accuracy.
System‑Level Optimizations: V‑Half Scheduling
Because multimodal workloads have heterogeneous compute costs, LongCat‑Next adopts a V‑Half pipeline scheduling strategy that folds the embedding stage and the modality‑specific loss stage onto the same device, eliminating pipeline bubbles and reducing inter‑stage communication via zero‑copy memory access.
Data Balancing and Reinforcement Learning
To avoid homogenized “AI style” images, the team applies a clustering‑based re‑balancing that de‑duplicates dense clusters and up‑weights rare concepts (e.g., obscure flora, specialized instruments). In reinforcement learning, the discrete visual latent space serves as an action space, allowing the use of language‑model‑centric RL algorithms such as GRPO. A sequence‑level filtering mechanism discards token sequences that exhibit entropy‑driven divergence, stabilizing RL training.
Future Directions
The authors identify two core challenges: maintaining semantic completeness under higher compression rates, and improving stability and controllability for long sequences in the unified token space. They envision moving beyond “text‑to‑image” or “image‑to‑text” toward truly arbitrary multimodal interactions, including multi‑turn visual dialogue and dynamic cross‑modal reasoning.
Conclusion
The paper proposes the DiNA framework, demonstrating that when images, audio and text are all represented as discrete tokens, the resulting model can achieve fine‑grained understanding, high‑quality generation, and robust cross‑modal semantics, suggesting a promising path toward general multimodal intelligence.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
