Defining a Good Answer in the Agent Era: A Rubrics Survey

This survey examines how rubrics—structured, multi‑dimensional evaluation criteria—are defined, constructed, and applied to train and evaluate large language models, especially for open‑ended, high‑risk and agentic tasks, while highlighting current challenges such as reward hacking and bias.

Data Party THU
Data Party THU
Data Party THU
Defining a Good Answer in the Agent Era: A Rubrics Survey

As large language models (LLMs) are applied to open‑ended tasks such as deep research, medical or legal consulting, multimodal generation, and long‑horizon agents, a single correct answer or a simple verification signal is often unavailable. Evaluating output quality therefore requires multi‑dimensional, interpretable standards.

Rubrics for LLMs

A rubric set consists of several rubric items . Each item contains a natural‑language description of a concrete quality dimension (e.g., factual correctness, coverage, evidence support, reasoning rigor, safety, format compliance, usability) and an importance weight. For a given input‑output pair a judge model scores every item; the scores are combined by averaging, weighted sum, or other implicit aggregation to produce an overall evaluation.

Rubrics differ from related concepts: LLM‑as‑a‑Judge determines *who* evaluates, while rubrics define *what* criteria to use; a traditional reward model outputs a single scalar, whereas rubrics make the criteria explicit; RLVR relies on automatically verifiable answers, while rubrics are suited for tasks that cannot be fully verified.

Construction Paradigms

The survey classifies rubric construction into four paradigms (see

Rubrics construction paradigms
Rubrics construction paradigms

):

Direct generation : a powerful LLM generates a set of evaluation criteria given the task instruction, candidate answer, or reference evidence.

Contrast generation : the model receives a high‑quality and a low‑quality answer and extracts discriminative criteria by comparing them.

Iterative optimization : researchers verify, decompose, and filter criteria in multiple rounds, removing overly broad or redundant items to obtain a more atomic and compact rubric set.

Online co‑evolution : for reinforcement‑learning and agent tasks, rubrics evolve together with policy rollouts, incorporating newly observed error behaviours as evaluation standards.

Rubrics in Policy Model Training

In reinforcement‑learning (e.g., PPO, GRPO) the judge model scores each rubric item for the generated answer or for the entire trajectory; the aggregated score becomes the reward signal. Applying rubrics at the trajectory level is crucial for tool‑calling agents and multimodal reasoning, where many errors are not reflected in the final answer. The workflow is illustrated in

Rubrics in policy training
Rubrics in policy training

.

Simple weighted aggregation can be too coarse; some dimensions (e.g., safety in medical QA) act as veto conditions where any violation forces the reward to zero. Advanced designs therefore include learnable rubric weights, veto or saturation mechanisms, curriculum based on difficulty, and integration of environment feedback.

Rubrics in Reward‑Model Training

Traditional reward models output an opaque scalar, making it hard to trace why one answer is preferred. By training the reward model to first analyse each rubric item, the model can:

Expose the reasoning behind its preference, improving interpretability.

Receive fine‑grained supervision signals: rubric‑level reference annotations (human‑written or teacher‑model generated) guide the intermediate analysis.

Generate higher‑quality training data that focus on factuality, completeness, safety, and reasoning rather than superficial cues such as length or formatting.

Figure 4 (see

Rubrics reward model
Rubrics reward model

) categorises three representative approaches.

Rubrics for Evaluation

Rubrics serve as explicit standards for open‑ended evaluation. The survey organises existing rubric‑based benchmarks into:

General tasks: reasoning, deep‑research report generation, open‑generation, agent ability, alignment. Dimensions include factual correctness, coverage, evidence support, safety, usability, etc.

Domain‑specific tasks: medical, legal, financial. Benchmarks assess factual accuracy, safety, professional expression, auditability, risk disclosure, and practical usability.

Representative benchmark illustrations are shown in

General task benchmarks
General task benchmarks

and

Domain‑specific benchmarks
Domain‑specific benchmarks

.

Open Challenges

Reward hacking – models may learn to satisfy superficial rubric features without genuine quality improvement.

Generalization – rubrics derived from specific tasks can cause reward models to overfit and lose transferability to new domains.

Evaluation bias – the wording of rubrics and the choice of judge model can introduce systematic bias.

Personalized rubrics & safety – tailoring rubrics to user preferences risks over‑fitting to shallow preferences or conflicting with safety constraints; maliciously altered rubrics could become an attack surface.

Conclusion

Rubrics provide an explicit, structured, and explainable quality interface that links human expectations, task requirements, and model behaviour. By decomposing “good answer” into concrete, checkable standards, rubrics enable more reliable training signals, interpretable reward models, and transparent evaluation for the increasingly open, high‑risk, and agentic applications of LLMs.

Code example

来源:机器之心
本文
约4500字
,建议阅读
5
分钟
近年来,随着大模型从简单问答,走向深度研究、医疗咨询、多模态生成和长程 Agent 任务,一个基础问题变得越来越难回答:我们到底应该怎样
判断模
型输出的质量?
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language ModelsAgentEvaluationAI safetyTrainingReward ModelingRubrics
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.