How Meeseeks Redefines LLM Instruction-Following Evaluation
Meeseeks, a new benchmark released by Meituan’s M17 team, systematically evaluates large language models’ instruction‑following ability with a three‑tier framework, multi‑round self‑correction, and extensive real‑world data, revealing performance gaps among models such as OpenAI o‑series, Claude, DeepSeek and Qwen2.5.
To address the observed gap between large‑model knowledge reasoning and instruction‑following abilities, Meituan’s M17 team introduced a new evaluation benchmark called Meeseeks, now available on ModelScope, GitHub, and HuggingFace.
1. Meeseeks: Redefining LLM "obedience" evaluation
2. Meeseeks evaluation results
3. Unique advantages of Meeseeks
4. Core evaluation insights
5. Summary and outlook
1. Meeseeks: Redefining LLM "obedience" evaluation
Meeseeks is a benchmark built entirely on real‑business data that focuses on assessing the Instruction‑Following ability of large models. It uniquely evaluates whether a model strictly follows the user prompt, without judging the factual correctness of the answer, and employs a multi‑level framework to capture capabilities at different granularities.
1.1 Fine‑grained three‑tier assessment framework
Typical instruction‑following failures, such as exceeding a word limit or ignoring a constraint, are captured across three layers:
Level 1 – Core intent and structure : Checks intent recognition, overall output structure, and granular content validation.
Level 2 – Specific constraint implementation : Evaluates content constraints (topic, style, language, length) and format constraints (JSON, Markdown, item count).
Level 3 – Fine‑grained rule compliance : Assesses subtle rules such as rhyming, keyword avoidance, repetition bans, symbol usage, and language‑specific conventions.
2. Meeseeks evaluation results
The benchmark reveals significant differences in instruction‑following and self‑correction abilities across models. RLLMs (reasoning language models) dominate all rounds, while several well‑known LLMs show varied performance.
OpenAI o‑series : o3‑mini (high) and o3‑mini (medium) secure first and second places, demonstrating strong instruction‑following.
GPT‑4o : Ranks eighth, with low initial accuracy and modest improvement in multi‑round correction.
Claude series : Claude‑3.7‑Sonnet‑thinking ranks third; standard Claude‑3.7‑Sonnet leads among general LLMs.
DeepSeek series : Mid‑range performance; DeepSeek‑V3 catches up with DeepSeek‑R1 after multiple rounds.
Qwen2.5 series : Smaller 32B model outperforms the larger 72B version after three rounds.
3. Unique advantages of Meeseeks
3.1 Horizontal comparison: broader, finer, more objective, higher difficulty
Compared with benchmarks like IF‑Eval and ComplexBench, Meeseeks offers:
Wider coverage : Real‑world business scenarios ensure comprehensive evaluation.
Finer granularity : Breaks down constraints (e.g., exact word count, range, multiples) for precise profiling.
Objective metrics : Eliminates vague instructions, using only determinable criteria.
Higher difficulty : Challenging test cases amplify performance gaps.
3.2 Vertical innovation: revolutionary "multi‑round correction" mode
Meeseeks introduces a flexible evaluation that does not force a specific output format, making it compatible with diverse models. In the multi‑round mode, if a model’s first answer violates any instruction, the framework generates explicit feedback and requires a corrected response, thus measuring self‑correction capability for the first time.
More flexible assessment, less affected by model output style.
Automatic feedback loop drives models to refine answers across rounds.
4. Core evaluation insights
Strong self‑correction potential : All models improve accuracy after feedback; e.g., Claude‑3.7‑Sonnet jumps from 0.359 to 0.573 in round 2.
First‑round vs. final performance : Early performance does not fully predict final results; some models recover strongly in later rounds.
RLLMs outperform LLMs in instruction following : Models like o3‑mini show both high initial scores and significant gains.
Effect of multi‑round feedback : The advantage of more powerful reasoning models narrows after several correction rounds, indicating feedback can compensate for longer reasoning chains.
5. Summary and outlook
Meeseeks demonstrates that precise, multi‑tier instruction‑following evaluation uncovers real shortcomings of top LLMs and validates their strong self‑repair abilities, guiding developers to focus on both basic capabilities and the ability to understand and execute corrective instructions.
Multilingual versions covering eleven languages are nearing completion, and future work will continue to advance high‑quality evaluation research for large models.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
