How Meeseeks Redefines LLM Instruction-Following Evaluation

Meeseeks, a new benchmark released by Meituan’s M17 team, systematically evaluates large language models’ instruction‑following ability with a three‑tier framework, multi‑round self‑correction, and extensive real‑world data, revealing performance gaps among models such as OpenAI o‑series, Claude, DeepSeek and Qwen2.5.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
How Meeseeks Redefines LLM Instruction-Following Evaluation

To address the observed gap between large‑model knowledge reasoning and instruction‑following abilities, Meituan’s M17 team introduced a new evaluation benchmark called Meeseeks, now available on ModelScope, GitHub, and HuggingFace.

1. Meeseeks: Redefining LLM "obedience" evaluation

2. Meeseeks evaluation results

3. Unique advantages of Meeseeks

4. Core evaluation insights

5. Summary and outlook

1. Meeseeks: Redefining LLM "obedience" evaluation

Meeseeks is a benchmark built entirely on real‑business data that focuses on assessing the Instruction‑Following ability of large models. It uniquely evaluates whether a model strictly follows the user prompt, without judging the factual correctness of the answer, and employs a multi‑level framework to capture capabilities at different granularities.

1.1 Fine‑grained three‑tier assessment framework

Typical instruction‑following failures, such as exceeding a word limit or ignoring a constraint, are captured across three layers:

Level 1 – Core intent and structure : Checks intent recognition, overall output structure, and granular content validation.

Level 2 – Specific constraint implementation : Evaluates content constraints (topic, style, language, length) and format constraints (JSON, Markdown, item count).

Level 3 – Fine‑grained rule compliance : Assesses subtle rules such as rhyming, keyword avoidance, repetition bans, symbol usage, and language‑specific conventions.

2. Meeseeks evaluation results

The benchmark reveals significant differences in instruction‑following and self‑correction abilities across models. RLLMs (reasoning language models) dominate all rounds, while several well‑known LLMs show varied performance.

OpenAI o‑series : o3‑mini (high) and o3‑mini (medium) secure first and second places, demonstrating strong instruction‑following.

GPT‑4o : Ranks eighth, with low initial accuracy and modest improvement in multi‑round correction.

Claude series : Claude‑3.7‑Sonnet‑thinking ranks third; standard Claude‑3.7‑Sonnet leads among general LLMs.

DeepSeek series : Mid‑range performance; DeepSeek‑V3 catches up with DeepSeek‑R1 after multiple rounds.

Qwen2.5 series : Smaller 32B model outperforms the larger 72B version after three rounds.

3. Unique advantages of Meeseeks

3.1 Horizontal comparison: broader, finer, more objective, higher difficulty

Compared with benchmarks like IF‑Eval and ComplexBench, Meeseeks offers:

Wider coverage : Real‑world business scenarios ensure comprehensive evaluation.

Finer granularity : Breaks down constraints (e.g., exact word count, range, multiples) for precise profiling.

Objective metrics : Eliminates vague instructions, using only determinable criteria.

Higher difficulty : Challenging test cases amplify performance gaps.

3.2 Vertical innovation: revolutionary "multi‑round correction" mode

Meeseeks introduces a flexible evaluation that does not force a specific output format, making it compatible with diverse models. In the multi‑round mode, if a model’s first answer violates any instruction, the framework generates explicit feedback and requires a corrected response, thus measuring self‑correction capability for the first time.

More flexible assessment, less affected by model output style.

Automatic feedback loop drives models to refine answers across rounds.

4. Core evaluation insights

Strong self‑correction potential : All models improve accuracy after feedback; e.g., Claude‑3.7‑Sonnet jumps from 0.359 to 0.573 in round 2.

First‑round vs. final performance : Early performance does not fully predict final results; some models recover strongly in later rounds.

RLLMs outperform LLMs in instruction following : Models like o3‑mini show both high initial scores and significant gains.

Effect of multi‑round feedback : The advantage of more powerful reasoning models narrows after several correction rounds, indicating feedback can compensate for longer reasoning chains.

5. Summary and outlook

Meeseeks demonstrates that precise, multi‑tier instruction‑following evaluation uncovers real shortcomings of top LLMs and validates their strong self‑repair abilities, guiding developers to focus on both basic capabilities and the ability to understand and execute corrective instructions.

Multilingual versions covering eleven languages are nearing completion, and future work will continue to advance high‑quality evaluation research for large models.

AIbenchmarkinstruction followingself-correctionLLM evaluationMeeseeks
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.