Artificial Intelligence 20 min read

Why Binary Success Rate Is Obsolete: Introducing PRM-as-a-Judge for Dense Evaluation of Embodied Tasks

The article critiques binary success rate for long‑horizon robotic tasks, proposes the PRM-as-a-Judge framework with a potential‑based progress signal and the three‑layer OPD metric suite, validates it on the RoboPulse benchmark, and shows how it yields fine‑grained, diagnostic insights into policy performance.

Machine Heart

Apr 14, 2026

Why Binary Success Rate Is Obsolete: Introducing PRM-as-a-Judge for Dense Evaluation of Embodied Tasks

1. Why binary success rate is insufficient for long‑horizon tasks

Traditional evaluation of embodied robotics relies on a binary success label that only answers whether a task was completed. For complex, multi‑stage, contact‑rich tasks this metric lacks resolution: it cannot indicate how far a policy progressed, whether the execution was stable, or at which stage a failure occurred.

The authors identify two main shortcomings:

Resolution deficiency : Different trajectories that both end in failure are collapsed into a single label, hiding important differences in progress depth.

Limited diagnostic power : Binary outcomes do not reveal how a robot succeeded or why it failed, preventing deeper analysis of bottlenecks.

Consequently, evaluating long‑horizon policies requires metrics that capture intermediate progress, stability, and failure modes.

2. From result‑based to process‑level evaluation

Instead of relying on privileged simulator signals (exact pose, contact forces), the authors base evaluation on visual state evolution. They assign each state a scalar progress potential \(\Phi\) in the range \([0,1]\). A trajectory thus becomes a continuous progress curve, enabling measurement of depth, back‑tracking, and stagnation.

3. Requirements for a dense evaluator

The framework formalizes dense evaluation with two core properties:

Macro‑consistency : Progress values must be additive over time; splitting a trajectory into shorter segments should not change the accumulated progress.

Micro‑resolution : The evaluator must detect fine‑grained, task‑relevant state changes rather than coarse visual differences.

Using a potential‑based formulation, macro‑consistency is guaranteed by construction, while micro‑resolution is verified with a dedicated benchmark.

4. OPD: Outcome–Process–Diagnosis three‑layer metric suite

Building on the progress potential, the authors define OPD:

Outcome : Milestone Coverage (MC) and Max Progress (MP) quantify how far a trajectory reaches.

Process : Path‑weighted Progress Length (PPL) measures efficiency and redundancy; higher PPL indicates smoother, more monotonic progress.

Diagnosis : Cumulative Regret Area (CRA) captures total back‑tracking cost, while Stagnation Ratio (STR) measures the proportion of time with negligible task‑related progress.

These five core indicators together provide a structured, diagnostic view of execution.

5. RoboPulse: Benchmark for micro‑resolution

RoboPulse converts progress evaluation into a pairwise judgment task: given two consecutive frames from the same trajectory, the evaluator must decide whether the later frame represents forward or backward progress. The benchmark filters out static or oscillatory segments and samples pairs at Small, Medium, and Large hop distances.

RoboPulse contains 1,800 pairwise samples drawn from 1,622 trajectories across 816 tasks and seven data sources (real robots, simulation, UMI capture, first‑person human video, etc.).

Experiments compare PRM‑based judges against CLIP‑based similarity methods and multimodal foundation models (e.g., Gemini, GPT‑5.2). Results show PRM judges achieve higher accuracy overall (e.g., Robo‑Dopamine 0.83 vs. Gemini 0.66) and especially in the challenging Small‑hop regime (Robo‑Dopamine 0.80 vs. Gemini 0.54), confirming superior micro‑resolution.

6. Re‑examining real policy trajectories with OPD

After validating micro‑resolution, the authors apply PRM‑as‑a‑Judge to five representative policies (DP, ACT, RDT, pi0, OpenVLA‑OFT) on multiple long‑horizon tasks, running 50 rollouts per policy‑task pair.

6.1 Where do failures occur?

Outcome metrics reveal that many rollouts succeed early but fail near the end (e.g., MC@25 = 84–100 % but MC@100 ≈ 0–8 %). This shows failures are concentrated in the final stages rather than being uniformly distributed.

6.2 Success does not equal quality

Among successful rollouts, DP attains a high PPL (94.9) and low CRA (0.26), indicating efficient, low‑regret execution, whereas other policies achieve lower PPL or higher CRA despite reaching the goal.

6.3 Different failure mechanisms

Diagnosis metrics differentiate failure modes: OpenVLA‑OFT exhibits high CRA (26.3) and moderate MP, indicating late‑stage back‑tracking, while ACT shows high STR (65.4), reflecting early stagnation.

6.4 Joint OPD profiles expose method gaps

Applying OPD to the RoboChallenge Table30 leaderboard shows that top models (e.g., DM0) excel not only in completion rate but also in MC, MP, and PPL, whereas models like GigaBrain‑0.1 achieve comparable MP but lower MC@100, revealing a “last‑mile” gap.

7. Interactive trajectory audit

The project website provides an interactive demo where users can sync video playback with the evolving OPD signals (MC, MP, PPL, CRA, STR), enabling frame‑level inspection of progress, back‑tracking, and stagnation.

8. Conclusion: From “did it finish?” to “how did it finish?”

PRM‑as‑a‑Judge extends evaluation beyond binary success by introducing a potential‑based progress signal, the OPD three‑layer metric suite, and the RoboPulse benchmark for fine‑grained validation. This dense, diagnostic framework uncovers nuanced differences in policy behavior, offering a more informative basis for advancing embodied robotic research.