Artificial Intelligence 8 min read

How a Rubric‑Driven Agent Achieves More Stable Outputs

The article explains why vague expectations cause unstable Agent results, introduces Rubric as a concrete, pre‑written scoring standard for Generator‑Critic workflows, details how to design clear Yes/No criteria, organize them into Must/Should/Nice‑to‑have layers, and iteratively refine the Rubric for reliable AI output.

FunTester

May 17, 2026

How a Rubric‑Driven Agent Achieves More Stable Outputs

In the previous discussion about the Generator + Critic dual‑mode, many readers wondered how the Critic knows what to evaluate. The answer is a Rubric – a pre‑written scoring standard that tells the Agent exactly what acceptance criteria to follow.

A Rubric is not a vague quality requirement; it is a concrete set of rules that the Agent can use to generate, check, and rework its output. The clearer the Rubric, the lower the collaboration cost between Generator and Critic, while an ambiguous Rubric makes automatic fixes feel like gambling.

Giving an Agent a fuzzy task yields a fuzzy result. After adding a Critic, many still use vague standards for review, such as “check whether test cases are complete and of high quality.” Phrases like “complete” or “high quality” lack clear boundaries, leaving the Critic to score based on feeling rather than a solid rule.

The key is to turn abstract terms into verifiable conditions. Each Rubric item should be answerable with a simple Yes/No, independent of subjective judgment.

评分标准：
1. 是否包含正向用例？（至少 1 条）
2. 是否包含逆向用例？（至少 1 条）
3. 是否包含边界值用例？（至少 1 条）
4. 每条用例是否有明确的预期结果？
5. 用例之间是否存在重复的测试逻辑？

With such explicit criteria, the Critic can check each line item without guessing the author’s intent.

Abstract words are decomposed into concrete checks: “complete” becomes positive, negative, boundary, exception, duplicate checks; “high quality” becomes clear expected results, executable steps, reusable data, observable assertions. As long as each item can be inspected, the Critic’s feedback turns from subjective evaluation into structured acceptance.

┌─────────────────────────────────────┐
│  必须项（Must）                     │
│  不满足直接打回，不进入下一轮       │
├─────────────────────────────────────┤
│  应该项（Should）                    │
│  不满足扣分，但不一定触发重试       │
├─────────────────────────────────────┤
│  加分项（Nice to have）             │
│  满足则更好，不满足不影响通过       │
└─────────────────────────────────────┘

In a test‑case generation workflow, Must items might be coverage of key business paths, explicit assertions, and inclusion of boundary values; Should items could be clear data naming, no duplicate logic, and maintainable steps; Nice‑to‑have items might add risk notes, priority tags, or automation suggestions. This three‑tier design lets the Critic give meaningful feedback rather than a simple pass/fail.

Feedback must point to specific failed items. A generic response like “test case failed, please regenerate” leaves the Generator clueless, often reproducing the same error. Instead, the Critic should say, for example, “Item 3 not satisfied: missing boundary‑value case. Item 5 not satisfied: test logic of case 2 duplicates case 4.” This tells the Generator exactly what to fix.

Effective feedback should contain three pieces of information: the failed item, evidence of failure, and the repair scope, so the Generator can modify only the relevant part without breaking already‑passing sections.

Rubrics also need iteration. The first version is rarely complete; common problems are missing judgment dimensions or overly loose items that let bad output slip through. After the Agent runs for a while, collect cases where the Critic passed but human reviewers disagreed, map them to missing Rubric rules, and add or tighten those rules. The goal is not to make the Rubric longer but more precise—each rule should be able to block a real failure or be relaxed if it over‑blocks.

In summary, a Rubric is a checklist that can be verified line by line. Its quality directly determines whether the Generator + Critic pattern can truly work. To achieve stable Agent output, provide both the goal and a clear acceptance standard, let the Critic point out concrete failures, and continuously refine the Rubric based on real‑world feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Prompt Engineering Agent AI evaluation Critic Rubric

Written by

FunTester

10k followers, 1k articles | completely useless

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.