Artificial Intelligence 10 min read

How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference

LongCat-Flash-Thinking, the latest open‑source model from Meituan, introduces domain‑parallel RL training, a high‑throughput DORA infra, and a dual‑path inference framework that together achieve state‑of‑the‑art performance on logical, mathematical, coding, and agentic tasks while maintaining top‑tier speed.

Baobao Algorithm Notes

Sep 23, 2025

How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference

Introduction

Meituan has officially released LongCat-Flash-Thinking, the successor to LongCat-Flash. Building on the previous model’s algorithmic and infrastructure co‑design, the new version further improves inference capability and efficiency, reaching the leading open‑source performance on a wide range of reasoning tasks, with some results approaching closed‑source GPT‑5‑Thinking.

Key Technical Innovations

1. Domain‑parallel RL training and fusion : To address instability in traditional mixed‑domain RL training, a parallel‑domain scheme decouples optimization for STEM, coding, and agentic tasks. This stabilizes training and enables fusion of domain‑expert models into a single, near‑Pareto‑optimal model that performs well across all specialties.

2. Industrial‑grade RL infrastructure (DORA) : Named DORA (Dynamic Orchestration for Asynchronous Rollout), the system uses elastic colocation scheduling and a multi‑version asynchronous pipeline. It delivers up to three‑fold speedup over synchronous RL frameworks while preserving per‑sample policy consistency, and supports KV‑cache reuse for clusters with tens of thousands of cards. During large‑scale asynchronous RL, FLOPs consumption is only about 20% of the pre‑training phase.

3. Efficient advanced inference framework : A new dual‑path inference architecture lets the model autonomously select optimal query samples and combine agentic reasoning with tool usage (e.g., code executors, APIs). In AIME‑25 tests, the model saves 64.5% of tokens while keeping 90% accuracy, dramatically improving resource efficiency.

Performance Benchmarks

General reasoning : Achieves 50.3 on the ARC‑AGI benchmark, surpassing closed‑source models such as OpenAI o3 and Gemini‑2.5 Pro.

Mathematics : Sets new records on HMMT and AIME‑related benchmarks, outperforming OpenAI o3.

Code generation : Scores 79.4 on LiveCodeBench, matching GPT‑5 and exceeding other open‑source models; obtains 40.7 on OJBench, comparable to Gemini‑2.5 Pro.

Agentic tool use : Reaches 67.5 on τ2‑Bench‑Airline, a new open‑source SOTA, and ranks highly on SWE‑Bench, BFCL V3, and VitaBench.

Formal reasoning : Obtains a pass@1 of 67.6 on MiniF2F‑test, leading all evaluated models, with strong pass@8 and pass@32 scores.

Practical Evaluation

The model was tested via the LongCat platform (https://longcat.ai/). Users reported:

Very fast response speed.

Balanced answer length—thoughtful when needed, concise otherwise.

Strong mathematical performance consistent with benchmark results.

Occasional output termination on extremely long contexts (cause under investigation).

Sample interactions demonstrated reliable reversal of strings, instruction following without prohibited words, handling of hallucination‑prone prompts, and correct logical deductions.

Tool‑Calling Capabilities

The built‑in Python sandbox allows execution of code prompts. Example prompts and model outputs are shown below.

Test prompt: Count the number of times subsequence t appears in s.

Example:
s = "rabbbit", t = "rabbit" → output 3

Now query:
s = "babggabagbabggbbaaabg", t = "bag"

The model correctly computed the answer (see accompanying screenshots).

Additional tests included word‑ladder transformations and complex algorithmic challenges, all solved correctly.

Conclusion

LongCat-Flash-Thinking retains the speed of LongCat‑Flash‑Chat while substantially improving reasoning across mathematics, logic, programming, automated theorem proving, and tool usage. Its innovations in RL training, infrastructure, and inference make it a noteworthy open‑source contender in the first‑tier AI landscape.

benchmark tool use Inference RL training LongCat

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.