LongCat-Flash-Omni: 560B Open‑Source Multimodal Model with Real‑Time Interaction
LongCat-Flash-Omni, the latest open‑source model from Meituan, combines a 560 billion‑parameter architecture, efficient multimodal perception and speech reconstruction modules, and a progressive training strategy to deliver real‑time audio‑video interaction and state‑of‑the‑art performance across text, image, audio, and video tasks.
On September 1 Meituan released the LongCat‑Flash series, and the newest member, LongCat‑Flash‑Omni, is now open‑source.
LongCat‑Flash‑Omni builds on the efficient Shortcut‑Connected MoE architecture (including zero‑compute experts) and integrates high‑efficiency multimodal perception and speech reconstruction modules. Despite a total parameter count of 560 billion (270 billion active), it achieves low‑latency real‑time audio‑video interaction.
Comprehensive evaluations show that LongCat‑Flash‑Omni reaches open‑source state‑of‑the‑art (SOTA) on full‑modal benchmarks and demonstrates strong competitiveness on individual text, image, video, and speech tasks. It is the first open‑source large language model to combine full‑modal coverage, end‑to‑end architecture, and efficient inference at this scale, delivering millisecond‑level response times.
Hugging Face: https://huggingface.co/meituan-longcat/LongCat-Flash-Omni
GitHub: https://github.com/meituan-longcat/LongCat-Flash-Omni
LongCat‑Flash‑Omni integrates offline multimodal understanding with real‑time audio‑video interaction in a unified end‑to‑end framework. Visual and audio encoders act as multimodal sensors, the LLM directly processes inputs and generates text and speech tokens, and a lightweight audio decoder reconstructs natural speech waveforms, all with streaming‑optimized design.
Even with 560 billion parameters, the model achieves low‑latency interaction thanks to the ScMoE backbone, efficient multimodal codecs, and a block‑wise audio‑video feature interleaving mechanism. It supports a 128K token context window and over 8 minutes of continuous audio‑video interaction, excelling in long‑term memory, multi‑turn dialogue, and temporal reasoning.
The model adopts a progressive early multimodal fusion training strategy to address modality heterogeneity, gradually incorporating text, audio, image, and video data without degrading any single‑modal performance.
Stage 0: Large‑scale text pre‑training.
Stage 1: Introduce speech data aligned with text.
Stage 2: Add large‑scale image‑caption pairs for visual‑language alignment.
Stage 3: Incorporate complex video data for spatio‑temporal reasoning.
Stage 4: Expand context window to 128K tokens.
Stage 5: Align audio encoder to handle continuous audio features.
Benchmark results show LongCat‑Flash‑Omni achieving SOTA on Omni‑Bench and WorldSense, and leading performance across modalities:
Text: Maintains and improves upon the series’ strong textual capabilities.
Image: RealWorldQA score of 74.8, comparable to closed‑source Gemini‑2.5‑Pro and surpassing other open‑source models.
Audio: Superior ASR on LibriSpeech and AISHELL‑1, strong TTS, S2TT, and audio understanding scores, approaching closed‑source performance.
Video: Best current results on video‑to‑text tasks, excelling on short and long video understanding.
Cross‑modal: Outperforms Gemini‑2.5‑Flash and matches Gemini‑2.5‑Pro on real‑world audio‑video benchmarks.
An end‑to‑end interaction evaluation, combining quantitative user ratings (250 users) and qualitative expert analysis (10 experts, 200 dialogues), shows LongCat‑Flash‑Omni surpassing the best open‑source model by 0.56 points in naturalness and fluency.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
