Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework

The Skill‑RM paper reveals that simply appending evaluation resources can degrade large‑model scoring, while structuring those resources into a Reward‑Evaluation Skill boosts performance across benchmarks, best‑of‑N selection, and RL‑based instruction following.

PaperAgent
PaperAgent
PaperAgent
Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework

What’s New in Skill‑RM: Turning Evaluation into a Skill

The core component of Skill‑RM is the Reward‑Evaluation Skill , which can be seen as an evaluation manual for the judge but goes beyond a prompt by containing five elements:

procedural specification : the step‑by‑step process the judge should follow;

resource bank : where rubrics, references, checklists, verifiers, tools, and calibration rules are stored;

invocation protocol : when to list resources, check them, and call tools;

evidence schema : what evidence each judgment must bind to;

output contract : the required format of the final output.

This transforms the judge from “reading all material and scoring directly” into a structured workflow: identify the evaluation target, activate relevant criteria, retrieve or execute resources, fill a judgment with evidence, and finally map it to a deterministic readout such as a pointwise score, pairwise preference, or best‑of‑N choice.

Skill‑RM Overview
Skill‑RM Overview

The Most Critical Table: More Data Does Not Equal Better Evaluation

Table 4 in the paper presents a resource‑use ablation. Counter‑intuitively, directly appending resources drops the average score from 83.9 to 81.0, indicating that the bottleneck is not the quantity of material but the ability to organize the evaluation process.

If resources are merely laid out for the model, they become noise. When they are incorporated into the Reward‑Evaluation Skill—becoming selectable, callable, evidence‑bound, and readable—they become part of the judgment capability.

In the Qwen‑3.5‑27B matched setting, the baseline judge scores 83.9 on RewardBench2, RM‑Bench, and JudgeBench. Skill‑RM raises this to 86.2. Adding sample‑specific resources pushes the average to 89.1. However, for a 9B model the same addition improves the score from 60.8 to 66.2, but further adding sample‑specific resources drops it to 65.7, showing that smaller models may not handle extra evidence well.

Table 4: Resource‑Use Ablation
Table 4: Resource‑Use Ablation

Beyond Benchmarks: Applications to Selection and RL

Skill‑RM’s unified framework can output pointwise scores, pairwise preferences, or best‑of‑N selections. In a Best‑of‑10 scenario evaluating GSM8K, IFEval, HumanEval+, and BigCodeBench, Skill‑RM achieves 97.8 on GSM8K, close to the Oracle@10 score of 97.9, and shows larger gains on IFEval and HumanEval+. BigCodeBench remains challenging, indicating room for improvement on complex code tasks.

Best‑of‑10 Results
Best‑of‑10 Results

On the IF‑RewardBench, Skill‑RM attains an average Kendall correlation of 0.524, surpassing Gemini‑3‑Flash (0.513) and Qwen‑3.5‑27B (0.411). However, Gemini‑3‑Flash performs better on the System‑Prompt subset.

In downstream instruction‑following RL, Skill‑RM scores an average of 45.9, higher than Tulu 3 (45.1) and VerIF (44.7). The improvement is modest but demonstrates that Skill‑RM can serve as a reward signal within training pipelines, not only as an offline evaluator.

Takeaways for Post‑Training Practitioners

For those working on RLHF, RLAIF, reward models, AI evaluation, or agent systems, the paper does not prescribe that everyone must adopt Skill‑RM. Instead, it highlights that evaluation quality may stem from process orchestration rather than merely scaling up the judge.

Historically, many improvements focused on stronger models, longer prompts, more references, or additional tools. Skill‑RM’s value lies in converting these loosely‑structured materials into an executable evaluation workflow.

Paper title: Skill‑RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
Paper link: https://arxiv.org/html/2606.03980v1
GitHub: https://github.com/Qwen-Applications/Skill-RM
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsRLHFEvaluation FrameworkReward ModelingAlibaba QwenSkill‑RM
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.