How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model
LongCat‑Flash is a 560‑billion‑parameter Mixture‑of‑Experts LLM that combines a dynamic zero‑computation expert design, shortcut‑connected MoE communication, variance‑aligned scaling, and a three‑stage agent‑centric pre‑training pipeline, delivering over 100 TPS on H800 GPUs at a cost of $0.70 per million tokens.
Overview
LongCat‑Flash is a 560 B parameter Mixture‑of‑Experts (MoE) language model released in the second half of 2025. It is designed for high‑throughput inference, reaching >100 TPS per user on H800 GPUs while activating only 18.6 B–31.3 B parameters per token (≈27 B on average).
Key Architectural Innovations
Zero‑Computation Experts
The expert pool contains N regular FFN experts and Z “zero‑computation” experts. Zero‑computation experts simply return their input, incurring no FLOPs. During routing each token selects K experts; the router’s softmax gate g_i = softmax(w_i·x + b_i) determines the weight of expert i. Only a subset of the selected experts performs actual computation, allowing the model to allocate more compute to difficult tokens and almost none to easy tokens, keeping the average active parameter count low.
Shortcut‑Connected MoE (ScMoE)
ScMoE inserts a cross‑layer shortcut that lets the dense FFN of the previous layer run in parallel with the dispatch/combine stages of the current MoE layer. This creates a larger compute‑communication overlap window, enabling a “single‑batch overlap” (SBO) pipeline that reduces token‑level latency by ~50 % compared with traditional two‑batch pipelines.
Variance Alignment for Scalability
Scaling to hundred‑billion‑parameter regimes revealed variance mis‑alignment in Multi‑head Latent Attention (MLA) and MoE modules. The authors introduce:
Scale‑correction factors for MLA that rescale query/key/value vectors to a unified reference variance.
A variance‑compensation factor for fine‑grained expert splitting that counteracts variance reduction caused by gating dilution and dimensional reduction.
These adjustments keep the initialization well‑conditioned and preserve performance at larger scales.
Pre‑Training Curriculum
The training follows three stages:
Stage 1 – Base Model : ~20 T tokens, sequence length 8192, establishing a stable foundation.
Stage 2 – Capability Enhancement : High‑quality data (several T tokens) to improve reasoning, domain expertise, and tool‑use abilities.
Stage 3 – Long‑Context Extension : Context length gradually increased from 8 k to 128 k tokens.
Stability mechanisms include:
Routing‑weight cosine‑similarity monitoring.
Hidden z‑loss to penalize extreme activation values.
Adam epsilon set to 1e‑16.
Hyperparameter transfer uses width‑scaling theory to map optimal variance and learning‑rate from a smaller proxy model to the full model. Model‑growth initialization expands a 14‑layer checkpoint to 28 layers via layer stacking, avoiding random initialization overhead.
Post‑Training for Advanced Capabilities
Inference & Encoding : Persona‑based self‑instruction data generation, multi‑model voting, and reward‑model verification for math tasks.
Agentic Tool Use : A multi‑agent synthesis framework (UserProfileAgent, ToolSetAgent, etc.) creates diverse, high‑difficulty agent tasks across information‑processing, tool‑set, and user‑interaction dimensions.
General Capabilities : Reverse‑prompt generation for complex instruction following; a 40‑category safety policy enforced by a two‑stage data synthesizer.
Training Infrastructure
Numerical Precision & Fault Detection : ULP‑based evaluation compares BF16 results with FP32 reference; on‑chip operator recomputation detects silent data corruption, especially in FlashAttention gradients.
Deterministic Kernels : Deterministic FlashAttention gradients (tile‑wise accumulation), deterministic ScatterAdd via hierarchical reduction, and optimized grouped GEMM kernels (5‑45 % speedup).
Distributed Strategy : Expert‑parallel groups of 32 accelerators combined with context‑parallelism, pipeline‑parallelism, and data‑parallelism. Token‑dimension chunking splits MoE computation into two blocks, reducing non‑overlapped communication from 25.3 % to 8.4 %.
V‑ZB Pipeline Parallelism : Balances memory across stages, keeping peak memory <60 GB and achieving near‑zero bubble efficiency.
Reliability & Observability : 98.48 % training availability; asynchronous checkpointing with 2‑4 s pause; fine‑grained profiling for bottleneck detection.
Inference and Deployment
SBO Pipeline : Four‑stage execution (MLA, dense FFN, dispatch, MoE GEMM) overlaps compute and communication within a single batch, eliminating the need for a second batch to hide latency.
KV‑Cache Compression : MLA’s built‑in cache compression reduces memory pressure, essential for SBO.
Speculative Decoding : A lightweight single‑dense‑layer draft model predicts multiple tokens; a C2T filter discards unlikely tokens before full model verification, keeping draft acceptance ~90 % while cutting latency.
System‑Level Optimizations : CUDA‑Graph fusion of forward, verification, and draft passes; multi‑step overlap scheduler; custom kernels (SwapAB for MoE GEMM, NVLink‑accelerated communication kernels); per‑layer mixed‑precision block quantization.
Deployment Architecture : PD‑Disaggregated design separates pre‑fill and decode nodes; layer‑wise KV transfer reduces first‑token latency under high concurrency.
On 128 H800 GPUs, bf16 LongCat‑Flash delivers 100.5 TPS per user at an estimated cost of $0.70 per million output tokens.
Performance Evaluation
Benchmark results (higher‑is‑better):
General knowledge – 2nd on ArenaHard‑V2 (86.50), MMLU 89.71, CEval 90.44.
Agent tool use – 1st on VitaBench (24.30), best on τ2‑Bench.
Programming – 2nd on TerminalBench (39.51), 60.4 on SWE‑Bench‑Verified.
Instruction following – 1st on IFEval (89.65), top on COLLIE (57.10) and Meeseeks‑zh (43.03).
Resources
Code and model checkpoints are publicly available:
Trial site: https://longcat.chat
HuggingFace repo: https://huggingface.co/meituan-longcat/LongCat-Flash-Chat
GitHub repo: https://github.com/meituan-longcat/LongCat-Flash-ChatBaobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
