Artificial Intelligence 4 min read

Inside Meituan’s LongCat‑Flash‑Chat: 560B‑Parameter MoE Model with Ultra‑Fast Inference

Meituan has open‑sourced LongCat‑Flash‑Chat, a 5.6‑trillion‑parameter Mixture‑of‑Experts model that activates only a fraction of its weights per token, delivering mainstream‑level performance, high inference speed, and low cost for complex agent applications.

Efficient Ops

Sep 2, 2025

Inside Meituan’s LongCat‑Flash‑Chat: 560B‑Parameter MoE Model with Ultra‑Fast Inference

Meituan officially released and open‑sourced its first large model, LongCat‑Flash‑Chat, on GitHub and Hugging Face, and launched the dedicated website longcat.ai .

LongCat‑Flash adopts an innovative Mixture‑of‑Experts (MoE) architecture with a total of 5.6 trillion parameters, but each token activates only 186 billion to 313 billion parameters (averaging about 270 billion). This selective activation improves computational efficiency while preserving excellent performance. Benchmark tests show that, despite using far fewer active parameters, LongCat‑Flash‑Chat matches the overall performance of current leading models and demonstrates notable advantages in agent‑related tasks. The model is specially optimized for inference speed, making it well‑suited for high‑complexity and long‑duration intelligent‑agent applications.

Cross‑layer communication channels are incorporated within the MoE layers, allowing communication and computation to be highly parallelized, which significantly boosts both training and inference efficiency. Combined with deep low‑level optimizations, LongCat‑Flash completed training in just 30 days and achieves over 100 tokens per second per user on NVIDIA H800 GPUs.

Through algorithm‑engineering co‑optimization, LongCat‑Flash leads the industry in theoretical compute cost and inference speed compared to models of similar or even smaller scale. System‑level deep optimization enables a generation speed of 100 tokens / s on H800 while keeping inference cost extremely low—approximately 5 CNY per million output tokens.

Hugging Face: https://huggingface.co/meituan-longcat

GitHub: https://github.com/meituan-longcat/LongCat-Flash-Chat

Technical report: https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/tech_report.pdf

Website: https://longcat.ai/

Artificial Intelligence inference optimization Mixture of Experts Open-source Large Language Model Meituan

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.