AMO‑Bench: A New High‑Difficulty, Original Math Reasoning Benchmark for LLMs
AMO‑Bench, released by Meituan's LongCat team, is a 50‑question, IMO‑level math reasoning benchmark that combines original, high‑difficulty problems with automated scoring, exposing the current limits of top large language models whose best accuracy hovers around 52 % and offering a more discriminative evaluation tool for future model improvements.
Large language models (LLMs) are praised for their emerging reasoning abilities, yet existing math reasoning benchmarks such as AIME24/25 have become saturated: top models achieve >90% accuracy, reducing the tests' discriminative power and raising data‑leak concerns.
Why a New Benchmark?
To address these issues, Meituan’s LongCat team introduced AMO‑Bench, a 50‑question suite of competition‑level problems created by expert mathematicians. The questions are designed to match or exceed IMO difficulty, and current SOTA models still fail to reach a passing score, with the best accuracy at 52.4%.
Construction Pipeline
AMO‑Bench follows a full‑cycle workflow: Data Creation → Quality Review → Originality Review → Difficulty Review .
Data Creation: Experts with Olympiad experience write each problem, provide the final answer, and a step‑by‑step solution.
Quality Review: Triple blind review checks for unambiguous statements and ensures the required knowledge stays within core Olympiad topics (algebra, geometry, number theory, combinatorics).
Originality Review: n‑gram matching and web search compare each item against existing datasets (AIME, HMMT, etc.) and experts manually verify no high similarity, preventing data‑leak.
Difficulty Review: Two‑model screening (at least two models must fail three independent runs) and third‑party expert re‑evaluation guarantee difficulty no lower than IMO.
Dataset Characteristics
The 50 items cover five core Olympiad domains: algebra (22%), functions & sequences (26%), geometry (10%), number theory (18%), and combinatorics (24%). Answer lengths are substantially longer than those in traditional benchmarks, forcing models to generate extended logical chains.
Scoring Methodology
Four answer types are defined, each with a tailored automatic evaluator:
Numeric/Set/Expression (39 items): parser‑based verification using the Math‑Verify tool.
Descriptive answers (11 items): LLM‑based scoring with majority voting (using o4‑mini as the scorer), sampled five times per answer.
Human validation of 1,000 randomly sampled answers confirmed a 99.2% accuracy for the automated pipeline.
Evaluation Results
Twenty‑six leading LLMs (both open‑source and closed‑source, reasoning‑enabled and not) were evaluated. Key findings:
Closed‑source models lead but still fall short; GPT‑5‑Thinking (High) tops at 52.4% accuracy, most models below 40%.
Open‑source models are catching up: Qwen3‑235B‑A22B‑Thinking‑2507 reaches 47.8%, DeepSeek‑V3.1‑Thinking 47.6%.
Higher‑scoring models output far more tokens (average >35K), indicating that longer reasoning chains improve performance.
Test‑time scaling remains an effective lever: models that produce more tokens achieve higher accuracy, and newer model generations achieve better scores with fewer tokens.
Pass@32 exceeds 70% for top models, revealing latent problem‑solving potential.
Conclusions and Future Work
AMO‑Bench offers superior discriminative power compared with AIME24/25 and other math benchmarks, mitigates data‑leak risks through original problem design, and ensures automated evaluation reliability with 99.2% scoring accuracy. The LongCat team will continuously expand the benchmark, add new problem types, and explore high‑difficulty assessments for both general and domain‑specific reasoning.
Resources (open‑source):
Project homepage:
GitHub repository:
Hugging Face dataset:
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
