Artificial Intelligence 10 min read

LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery

LongCat-Flash-Omni, an open‑source 560 billion‑parameter multimodal model, combines efficient Shortcut‑Connected MoE architecture with advanced perception and speech modules to deliver low‑latency real‑time audio‑video interaction and state‑of‑the‑art performance across text, image, video, and audio tasks.

21CTO

Nov 4, 2025

LongCat-Flash-Omni: How an Open-Source 560B Model Achieves Real-Time Multimodal Mastery

LongCat-Flash-Omni, the latest member of the LongCat-Flash series released by Meituan, builds on the efficient Shortcut‑Connected MoE architecture with zero‑compute experts and adds high‑efficiency multimodal perception and speech reconstruction modules.

Despite a total parameter count of 560 billion (270 billion active), the model delivers low‑latency real‑time audio‑video interaction, offering developers an efficient choice for multimodal applications.

Comprehensive evaluations show LongCat‑Flash‑Omni reaches open‑source state‑of‑the‑art performance on full‑modality benchmarks and demonstrates strong competitiveness on text, image, video understanding, and speech perception/generation tasks.

It is the first open‑source large language model that simultaneously achieves full‑modality coverage, an end‑to‑end architecture, and efficient inference at massive scale, enabling millisecond‑level responses.

LongCat‑Flash‑Omni integrates a unified framework that combines offline multimodal understanding with real‑time audio‑video interaction. Visual and audio encoders act as perception modules; the LLM processes inputs and generates text and speech tokens, which a lightweight audio decoder reconstructs into natural speech waveforms. All modules are designed for streaming inference, with each encoder/decoder around 600 million parameters, preserving the series’ high‑efficiency design.

Leveraging the ScMoE backbone, the model supports a 128 K token context window and over 8 minutes of continuous audio‑video interaction, delivering high‑quality processing and streaming speech generation.

To address modality heterogeneity, LongCat‑Flash‑Omni adopts a progressive early‑fusion training strategy that gradually incorporates text, audio, and video, ensuring strong multimodal performance without degrading any single‑modality capability.

Stage 0: Large‑scale text pre‑training to establish a solid foundation.

Stage 1: Introduction of speech data aligned with text to integrate acoustic representations.

Stage 2: Incorporation of large‑scale image‑caption pairs for visual‑language alignment.

Stage 3: Inclusion of complex video data for spatio‑temporal reasoning and enhanced visual understanding.

Stage 4: Expansion of the context window from 8K to 128K tokens for long‑context inference.

Stage 5: Alignment training of the audio encoder to handle continuous audio features and improve speech task fidelity.

Extensive evaluation shows the model attains SOTA on Omni‑Bench, WorldSense, and leads open‑source models across text, image, audio, video, and cross‑modal tasks.

Text: Maintains and improves performance across multiple domains compared to earlier LongCat‑Flash versions.

Image Understanding: Scores 74.8 on RealWorldQA, comparable to the closed‑source Gemini‑2.5‑Pro and surpassing other open‑source models.

Audio Capability: Excels in ASR, TTS, speech‑to‑text translation, and audio understanding benchmarks, often outperforming Gemini‑2.5‑Pro.

Video Understanding: Achieves top performance on video‑to‑text tasks, with short‑video results far above existing models and long‑video results on par with Gemini‑2.5‑Pro.

Cross‑Modal Understanding: Outperforms Gemini‑2.5‑Flash and matches Gemini‑2.5‑Pro on real‑world audio‑video benchmarks.

Since no standard real‑time multimodal interaction benchmark exists, the LongCat team created a proprietary evaluation combining quantitative user scores (250 users) and qualitative expert analysis (10 experts, 200 dialogues). LongCat‑Flash‑Omni outperformed the best open‑source model Qwen3‑Omni by 0.56 points on naturalness and fluency, while still lagging behind top‑tier closed‑source models in real‑time, human‑likeness, and accuracy.

Model resources:

Hugging Face: https://huggingface.co/meituan-longcat/LongCat-Flash-Omni

GitHub: https://github.com/meituan-longcat/LongCat-Flash-Omni