Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions

LLM judges often produce contradictory scores and non‑transitive preferences; the TrustJudge framework replaces discrete scoring with distribution‑sensitive scoring and likelihood‑aware aggregation, dramatically reducing both score‑comparison and pairwise‑transitivity inconsistencies across multiple model families, improving accuracy and even serving as a reward signal for RL training.

Machine Heart
Machine Heart
Machine Heart
Can LLM Judges Be Trusted? TrustJudge Leverages Full Probability Distributions

1. How Severe Is the Problem?

Using large language models (LLMs) as judges is now standard (e.g., MT‑Bench uses single scores, AlpacaEval uses pairwise comparisons, RLHF/GRPO uses preference labels), but the two evaluation modes frequently conflict. On Llama‑3.1‑70B, the score‑comparison inconsistency rate is 23.32% (roughly one contradiction every four evaluations) and the pairwise transitivity inconsistency rate is 15.22%.

2. What Causes These Inconsistencies?

The authors quantify two inconsistency types and attribute them to information loss in discrete scoring and ambiguous ties in pairwise comparison. A 5‑point score collapses continuous model confidence into a few integer bins, discarding subtle differences (e.g., 3.8 vs 4.2 both become 4). Theoretically, two distinct probability distributions can share the same discrete score while having different entropies, proving that discretization inevitably loses information. Ambiguous ties arise when the model is unsure, leading to inconsistent "equal" judgments that break transitivity (A = B, B = C, but A ≠ C).

3. How Does TrustJudge Work?

The core idea is simple: don’t rely only on the discrete answer; also use the underlying probability distribution. TrustJudge consists of two components:

Distribution‑Sensitive Scoring: expand the score range from 5 to 100, apply softmax to all candidate logits to obtain a full probability distribution, and compute the weighted expectation as the final score. This preserves fine‑grained differences (e.g., two replies that both received a 4 now become 3.82 and 4.17).

Likelihood‑Aware Aggregation: to break ambiguous ties in pairwise comparison, two strategies are offered:

PPL‑Based: compute perplexity for the two possible orderings (A > B vs B > A) and select the ordering with lower perplexity, i.e., the one the model reads more fluently.

Bidirectional Probability Aggregation: sum the preference probabilities from both directions and pick the direction with higher confidence, which also cancels position bias.

Unlike G‑Eval, TrustJudge normalizes with softmax so that probabilities sum to exactly one, preventing non‑score tokens from interfering.

4. Theoretical Guarantees

TrustJudge’s design is backed by formal proofs. Theorem 1 (Information Preservation) shows that two different probability distributions that collapse to the same discrete score can be distinguished by distribution‑sensitive scoring. Proposition 1 (Uncertainty Reduction) proves that the entropy of the confidence distribution derived from the perplexity‑based method is strictly lower than the maximum entropy of the original discrete judgment.

5. Experimental Results

Experiments use MT‑Bench (80 questions) and ArenaHard (500 questions) with judge models from the Llama‑3 series (3B/8B/70B) and GPT‑4o.

Main experiments show that TrustJudge reduces both inconsistency types on all models and raises exact‑match accuracy. For example, Llama‑3.2‑3B’s pairwise‑transitivity inconsistency drops from 54.69% to 17.76% (a 37‑point reduction).

Ablation studies (including GPT‑3.5‑Turbo as a reference) reveal that both the 100‑point scale and softmax normalization contribute to lowering score‑comparison inconsistency, while likelihood‑aware aggregation and the PPL‑based method each significantly curb transitivity errors; the former is slightly superior overall.

6. Does It Generalize to Other Models?

The authors extend the evaluation to Qwen‑2.5 (7B/14B/32B), Gemma‑2 (2B/9B/27B), Llama‑3 (3B/8B/70B), and four GPT families (12 variants total). Findings:

Distribution‑sensitive scoring consistently reduces inconsistencies regardless of architecture.

With likelihood‑aware aggregation, an 8B model can outperform a 70B model lacking TrustJudge.

Smaller Gemma‑2 (9B) sometimes beats its larger counterpart (27B), indicating that bigger is not always better.

7. Counter‑Intuitive Observation: Reasoning‑Focused Models May Be Less Reliable

Models trained heavily on reasoning (e.g., DeepSeek‑R1 distilled) exhibit higher score‑comparison inconsistency (58.75%)—almost double that of similarly sized Llama models—yet TrustJudge still cuts their inconsistency by ~10 points and reduces transitivity from 63.98% to 18.50%.

8. Using TrustJudge as a Reward Signal

Beyond evaluation, TrustJudge can generate reward signals for reinforcement learning. Integrated with GRPO, it trains Qwen2.5‑7B‑Instruct on 8,600 instruction‑following examples. Compared with a baseline reward, TrustJudge‑derived rewards consistently yield higher final performance across multiple metrics, as shown by reward curves and task‑level evaluations.

9. Is Higher Scoring Granularity Enough?

Increasing granularity alone (5 → 10 → 100 points) lowers inconsistency, but TrustJudge still outperforms the baseline at every granularity level. The key advantage comes from combining finer granularity with probability normalization.

10. Summary

Discrete scores discard information → adopt distribution‑sensitive scoring to retain probability details.

Ambiguous ties break transitivity → employ likelihood‑aware aggregation to clarify judgments.

TrustJudge works out‑of‑the‑box without additional training, improves evaluation consistency across Llama, GPT, Qwen, and Gemma families, and can serve as a reliable reward source for RL. In short, to let LLMs act as judges, we must first ensure the judges themselves are internally consistent.

reinforcement learningreward modelingLLM evaluationprobability distributioninconsistency reductionTrustJudge
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.