How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang
LongCat-Flash, an open‑source Mixture‑of‑Experts model released by Meituan, leverages model‑system co‑design, PD‑disaggregation, SBO scheduling and large‑scale expert parallelism within the SGLang framework to deliver dramatically lower latency, higher throughput and cost‑effective inference for AI agents, with detailed deployment instructions provided.
1. Introduction: Meituan Open‑Sources LongCat-Flash Agent Model
LongCat-Flash is an innovative Mixture‑of‑Experts (MoE) model released by Meituan on September 1 and open‑sourced on Hugging Face. It features 560 billion parameters, 512 feed‑forward experts plus 256 zero‑compute experts, and employs Shortcut‑Connected MoE (ScMoE) for compute‑communication overlap and integrated multi‑head latent attention (MLA).
Benchmarks show that, despite being a non‑thinking base model, LongCat‑Flash matches or exceeds the performance of leading models with far fewer activated parameters per token, making it especially strong for agent tasks while offering significantly faster inference.
Figure 1: LongCat‑Flash technical specifications.
2. Why Model‑System Co‑Design Is Critical
LongCat‑Flash targets both throughput and latency in agent scenarios, where the ReACT pattern demands high prefill and decode speeds. To address this, the design introduces a zero‑expert mechanism that dynamically reduces activated parameters for less important tokens, keeping per‑token activation between 186 B and 313 B (average 270 B, 8 experts).
During decoding, the high sparsity of MoE models requires large batch sizes to achieve compute‑bound GEMM performance. LongCat‑Flash mitigates this by using ScMoE and Single Batch Overlap (SBO) to overlap communication and computation, enabling both high throughput and low latency.
3. Our Solution: SGLang + PD Separation + SBO Scheduling + Large‑Scale EP Deployment
3.1 PD Separation
We adopt a PD‑Disaggregated architecture that decouples prefill and decode stages, using layer‑wise transmission to significantly reduce first‑token latency under high QPS loads.
3.2 SBO
SBO implements a four‑stage pipeline with module‑level overlap:
Stage 1 : Independent execution of MLA output.
Stage 2 : All‑to‑all distribution combined with dense FFN and QKV projection.
Stage 3 : Independent MoE GEMM, benefiting from expert parallelism (EP).
Stage 4 : Overlap of second attention block, dense FFN, and all‑to‑all combine.
This design eliminates the throughput‑latency trade‑off, delivering both higher throughput and lower latency.
Figure 2: SBO scheduling diagram.
3.3 Large‑Scale Expert Parallel Deployment
Scaling expert parallelism (EP) frees KV‑cache memory and reduces MoE compute time. For example, EP 128 reduces per‑GPU MoE parameters to 5.3 % of memory, allowing KV‑cache to dominate. Combined with SBO, EP 128 achieves ~10 ms TPOT and ~800 tokens/s per GPU.
3.4 Additional Optimizations
Multi‑step overlapping scheduler to keep GPUs busy despite low forward‑pass latency.
Speculative decoding (MagicDec) with a lightweight draft model (MTP) and verification cost reduction via C2T filtering.
4. Performance Results
With the optimizations above, LongCat‑Flash outperforms same‑size and even smaller models. On a public H800 instance costing ¥14 per hour (≈ $2), it reaches 100 tokens/s (TPOT = 10 ms) at a cost of ¥5 per million tokens.
5. Deploying with SGLang
We recommend using SGLang to serve LongCat‑Flash. The model (560 B parameters) requires at least 8 × H20‑141G GPUs in FP8 mode or 16 × H800‑80G GPUs in BF16 mode.
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]>=0.5.1.post3"Single‑node deployment (8 × H20‑141G)
python3 -m sglang.launch_server \
--model meituan-longcat/LongCat-Flash-Chat-FP8 \
--trust-remote-code \
--attention-backend flashinfer \
--enable-ep-moe \
--tp 8Multi‑node deployment (16 × H800‑80G)
python3 -m sglang.launch_server \
--model meituan-longcat/LongCat-Flash-Chat \
--trust-remote-code \
--attention-backend flashinfer \
--enable-ep-moe \
--tp 16 \
--nnodes 2 \
--node-rank $NODE_RANK \
--dist-init-addr $MASTER_IP:5000To enable Multi‑Token Prediction (MTP), add the following flags:
--speculative-draft-model-path meituan-longcat/LongCat-Flash-Chat \
--speculative-algorithm NEXTN \
--speculative-num-draft-tokens 2 \
--speculative-num-steps 1 \
--speculative-eagle-topk 16. Conclusion
By combining SGLang, PD separation, large‑scale expert parallelism, and SBO scheduling, LongCat‑Flash achieves ultra‑low cost and ultra‑fast generation for AI agents. Ongoing collaboration with the open‑source community will further propagate these optimizations.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
