Artificial Intelligence 9 min read

General 365: Meituan LongCat’s Open‑Source Benchmark Redefines LLM Reasoning Evaluation

The General 365 benchmark, built from 365 original seed questions and 1,095 variants across eight reasoning challenges, reveals that most mainstream large language models struggle with everyday logical tasks, achieving at most 62.8% accuracy and requiring far more tokens than on traditional subject‑specific tests.

Meituan Technology Team

May 14, 2026

General 365: Meituan LongCat’s Open‑Source Benchmark Redefines LLM Reasoning Evaluation

Large language models (LLMs) have achieved top scores in competitions such as AIME and IMO, yet they still fail simple common‑sense questions like whether to walk or drive to a car‑wash 50 m away, exposing a critical weakness in current evaluation methods that focus on memorizing complex formulas rather than true logical reasoning.

Introducing General 365

To address this gap, the Meituan LongCat team released General 365, a benchmark that limits background knowledge to K‑12 level and explicitly separates reasoning ability from domain expertise. It comprises 365 original seed questions and 1,095 expanded variants, covering eight challenge types to avoid repetitive patterns and rote memorization.

High diversity: 365 seed items and 1,095 variants span eight challenge categories.

High challenge: Even state‑of‑the‑art models barely pass the pass line.

Focused reasoning: Knowledge is strictly limited to K‑12, measuring pure logical inference.

Strict human QA: All items undergo manual design, reasoning‑trace verification, and answer validation.

Precise scoring: A hybrid rule‑and‑model scoring method, validated by human sampling, achieves 99.6 % scoring accuracy.

Dataset validation

t‑SNE visualisation shows General 365’s question embeddings are uniformly dispersed, unlike BBH and BBEH which form dense clusters, indicating reduced logical redundancy. Logical independence is further confirmed by a similarity‑score test where Gemini 3 Pro rates General 365 questions at an average of 2.16 (0–5), far lower than scores for BBH and BBEH, proving that models cannot rely on template memorisation.

Comprehensive model evaluation

Twenty‑six mainstream LLMs were evaluated on General 365. Gemini 3 Pro achieved the highest accuracy of 62.8 %, narrowly winning the leaderboard; the majority of models fell between 50 % and 60 %, failing to reach the 60 % passing threshold. Non‑reasoning models performed slightly worse overall, though models such as Qwen 3 Max Instruct still showed notable results.

Category‑wise performance analysis

Breaking down results by the eight categories reveals that "semantic interference" and "optimal strategy" are the biggest performance valleys, each about ten percentage points lower than the overall accuracy. Radar charts illustrate clear capability differences among model families, especially on the "implicit information" task.

Computation cost versus accuracy

Beyond correctness, token consumption was examined. Gemini 3 Pro solved the benchmark with roughly 14 k tokens, whereas models achieving comparable accuracy required 25 k–30 k tokens, highlighting a substantial increase in computational effort for the same performance.

Difficulty comparison with existing benchmarks

Accuracy on General 365 drops sharply compared with BBH/BBEH. For example, GPT‑5‑Thinking scores 92.0 % on BBH but only 58.6 % on General 365. Moreover, average output length rises significantly on General 365, confirming that the benchmark’s difficulty stems from deeper logical chains rather than superficial token padding.

Conclusion and community invitation

General 365 establishes a calibrated ruler for genuine general‑reasoning ability, aiming to move LLMs from "test‑taking machines" toward human‑like intelligence. The project is fully open‑source, with paper, GitHub repository, and HuggingFace dataset links provided for researchers and developers to explore the next evolution of model reasoning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI reasoning LLM evaluation General 365 Meituan LongCat reasoning benchmark

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.