How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

LongCat-Flash, an open‑source Mixture‑of‑Experts model released by Meituan, leverages model‑system co‑design, PD‑disaggregation, SBO scheduling and large‑scale expert parallelism within the SGLang framework to deliver dramatically lower latency, higher throughput and cost‑effective inference for AI agents, with detailed deployment instructions provided.

LongCat-FlashMixture of ExpertsSGLang

0 likes · 15 min read

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

Baobao Algorithm Notes

Sep 2, 2025 · Artificial Intelligence

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model

LongCat‑Flash is a 560‑billion‑parameter Mixture‑of‑Experts LLM that combines a dynamic zero‑computation expert design, shortcut‑connected MoE communication, variance‑aligned scaling, and a three‑stage agent‑centric pre‑training pipeline, delivering over 100 TPS on H800 GPUs at a cost of $0.70 per million tokens.

Artificial IntelligenceLongCat-FlashMixture of Experts

0 likes · 23 min read

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model