How Intrinsic Self‑Critique Boosts LLM Planning Accuracy to 89% %
Google DeepMind's new "Intrinsic Self‑Critique" method lets large language models iteratively self‑evaluate and rewrite their plans, raising Blocksworld planning accuracy from 49.8% to 89.3% and setting new records across multiple planning benchmarks.
Overview
Google DeepMind introduced a technique called Intrinsic Self‑Critique that enables large language models (LLMs) to improve their planning abilities without any external validator. By repeatedly performing "self‑evaluation + rewrite" cycles, the method lifts Blocksworld planning accuracy from 49.8% to 89.3% and establishes new state‑of‑the‑art results on several planning benchmarks.
Why Self‑Critique Works
Earlier studies (e.g., Valmeekam’23, Huang’24) treated LLM self‑evaluation as a source of false positives because:
The model never truly performs step‑wise verification of actions.
Without an external oracle, repeated revisions tend to amplify errors.
The new approach addresses these issues with three key components:
Explicit State Tracking : The model must output a "premise‑result" pair for every step.
Failure Memory Pool : Past erroneous plans and their critiques are stored in the prompt to prevent repeat mistakes.
Self‑Consistency Voting : Each generated plan is evaluated five times; the majority vote reduces mis‑judgments.
Method Details
Algorithm 1 consists of two prompts: plan_prompt: A 16‑shot handcrafted example set that describes Blocksworld tasks using PDDL. critique_prompt: A zero‑shot prompt that supplies only the domain definition and the instruction "please verify each action premise step‑by‑step."
Ablation Study: Which Component Matters Most?
Removing each component yields the following accuracy drops:
Without step‑wise verification: accuracy falls to 57.5% (most critical).
Without domain definition: accuracy drops to 74.4% (still partially usable).
Without self‑consistency voting: accuracy falls to 85.5% (2–3 pp loss).
Cross‑Model Validation
The technique was tested on several LLMs:
GPT‑4o: baseline 42.8% → self‑critique 64.2% (+21.4 pp).
Claude 3.5 Sonnet: baseline 68.0% → self‑critique 89.5% (+21.5 pp).
Gemma‑2 27B: modest gains, indicating limited benefit for smaller models.
Practical Takeaways & Future Directions
Prompt as Plugin : The zero‑shot critique template can be reused in new domains without retraining.
Cost‑Effective : Convergence typically requires only 6–14 k tokens (≤10 iterations).
Next Steps :
Integrate self‑evaluation into Monte‑Carlo Tree Search (MCTS) or Tree‑of‑Thoughts for full tree search.
Scale experiments to real‑world planning tasks such as travel itinerary or project management.
Research methods to further reduce false positives and approach an oracle‑like feedback.
Conclusion
When an LLM is forced to act as a strict teacher—assigning red crosses to its own mistakes—it can dramatically lower error rates, achieving new planning SOTA and offering a simple yet powerful paradigm for unsupervised self‑improvement.
Who says LLMs can’t plan? They just needed to be taught to "check their homework."
Enhancing LLM Planning Capabilities through Intrinsic Self‑Critique
https://arxiv.org/pdf/2512.24103How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
