How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference
LongCat-Flash-Thinking, the latest open‑source model from Meituan, introduces domain‑parallel RL training, a high‑throughput DORA infra, and a dual‑path inference framework that together achieve state‑of‑the‑art performance on logical, mathematical, coding, and agentic tasks while maintaining top‑tier speed.
Introduction
Meituan has officially released LongCat-Flash-Thinking, the successor to LongCat-Flash. Building on the previous model’s algorithmic and infrastructure co‑design, the new version further improves inference capability and efficiency, reaching the leading open‑source performance on a wide range of reasoning tasks, with some results approaching closed‑source GPT‑5‑Thinking.
Key Technical Innovations
1. Domain‑parallel RL training and fusion : To address instability in traditional mixed‑domain RL training, a parallel‑domain scheme decouples optimization for STEM, coding, and agentic tasks. This stabilizes training and enables fusion of domain‑expert models into a single, near‑Pareto‑optimal model that performs well across all specialties.
2. Industrial‑grade RL infrastructure (DORA) : Named DORA (Dynamic Orchestration for Asynchronous Rollout), the system uses elastic colocation scheduling and a multi‑version asynchronous pipeline. It delivers up to three‑fold speedup over synchronous RL frameworks while preserving per‑sample policy consistency, and supports KV‑cache reuse for clusters with tens of thousands of cards. During large‑scale asynchronous RL, FLOPs consumption is only about 20% of the pre‑training phase.
3. Efficient advanced inference framework : A new dual‑path inference architecture lets the model autonomously select optimal query samples and combine agentic reasoning with tool usage (e.g., code executors, APIs). In AIME‑25 tests, the model saves 64.5% of tokens while keeping 90% accuracy, dramatically improving resource efficiency.
Performance Benchmarks
General reasoning : Achieves 50.3 on the ARC‑AGI benchmark, surpassing closed‑source models such as OpenAI o3 and Gemini‑2.5 Pro.
Mathematics : Sets new records on HMMT and AIME‑related benchmarks, outperforming OpenAI o3.
Code generation : Scores 79.4 on LiveCodeBench, matching GPT‑5 and exceeding other open‑source models; obtains 40.7 on OJBench, comparable to Gemini‑2.5 Pro.
Agentic tool use : Reaches 67.5 on τ2‑Bench‑Airline, a new open‑source SOTA, and ranks highly on SWE‑Bench, BFCL V3, and VitaBench.
Formal reasoning : Obtains a pass@1 of 67.6 on MiniF2F‑test, leading all evaluated models, with strong pass@8 and pass@32 scores.
Practical Evaluation
The model was tested via the LongCat platform (https://longcat.ai/). Users reported:
Very fast response speed.
Balanced answer length—thoughtful when needed, concise otherwise.
Strong mathematical performance consistent with benchmark results.
Occasional output termination on extremely long contexts (cause under investigation).
Sample interactions demonstrated reliable reversal of strings, instruction following without prohibited words, handling of hallucination‑prone prompts, and correct logical deductions.
Tool‑Calling Capabilities
The built‑in Python sandbox allows execution of code prompts. Example prompts and model outputs are shown below.
Test prompt: Count the number of times subsequence t appears in s.
Example:
s = "rabbbit", t = "rabbit" → output 3
Now query:
s = "babggabagbabggbbaaabg", t = "bag"The model correctly computed the answer (see accompanying screenshots).
Additional tests included word‑ladder transformations and complex algorithmic challenges, all solved correctly.
Conclusion
LongCat-Flash-Thinking retains the speed of LongCat‑Flash‑Chat while substantially improving reasoning across mathematics, logic, programming, automated theorem proving, and tool usage. Its innovations in RL training, infrastructure, and inference make it a noteworthy open‑source contender in the first‑tier AI landscape.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
