Meituan Technology Team
Meituan Technology Team
Sep 11, 2025 · Artificial Intelligence

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

LongCat-Flash, an open‑source Mixture‑of‑Experts model released by Meituan, leverages model‑system co‑design, PD‑disaggregation, SBO scheduling and large‑scale expert parallelism within the SGLang framework to deliver dramatically lower latency, higher throughput and cost‑effective inference for AI agents, with detailed deployment instructions provided.

LongCat-FlashMixture of ExpertsModel Inference
0 likes · 15 min read
How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 2, 2025 · Artificial Intelligence

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model

LongCat‑Flash is a 560‑billion‑parameter Mixture‑of‑Experts LLM that combines a dynamic zero‑computation expert design, shortcut‑connected MoE communication, variance‑aligned scaling, and a three‑stage agent‑centric pre‑training pipeline, delivering over 100 TPS on H800 GPUs at a cost of $0.70 per million tokens.

Artificial IntelligenceLarge Language ModelLongCat-Flash
0 likes · 23 min read
How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model