LLM-as-Judge — 4 Technical Articles

Jan 21, 2026 · Artificial Intelligence

Why Traditional Coding Benchmarks Miss the Mark: Inside OctoCodingBench’s Process‑Level Evaluation

The article examines the rapid progress of AI coding agents, critiques existing benchmarks that only measure final correctness, and introduces OctoCodingBench—a new suite that simulates real‑world constraints, records full interaction traces, and evaluates both task success and strict process compliance across multiple languages.

AI evaluationLLM-as-Judgecoding agents

0 likes · 10 min read

Why Traditional Coding Benchmarks Miss the Mark: Inside OctoCodingBench’s Process‑Level Evaluation

Baidu Tech Salon

Oct 10, 2025 · Artificial Intelligence

Navigating the 2025 AI Model Boom: Practical Evaluation Strategies

This article examines the rapid surge of large AI models in 2024‑2025, critiques the reliability of public leaderboards, and presents a business‑focused evaluation framework—including dataset construction, metric selection, automation, and LLM‑as‑judge techniques—to help developers choose the right model for real‑world applications.

AI benchmarksAI performanceLLM-as-Judge

0 likes · 17 min read

Navigating the 2025 AI Model Boom: Practical Evaluation Strategies

AntTech

Sep 19, 2025 · Artificial Intelligence

How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach

Ant Insurance’s tech team leveraged reinforcement learning, focused data selection, and a multi‑dimensional reward system to dramatically reduce hallucinations in LLMs, achieving top‑rank performance on the HHEM leaderboard and robust improvements across instruction‑following and reasoning‑enhanced models.

Hallucination ControlLLMLLM-as-Judge

0 likes · 6 min read

How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach

Baidu Geek Talk

Sep 10, 2025 · Artificial Intelligence

How to Cut Through the LLM SOTA Hype: Practical Evaluation Strategies for 2025

Amid the 2025 surge of large language models, this article demystifies misleading SOTA claims, critiques benchmark reliability, and presents a comprehensive, business‑focused evaluation framework—including dataset construction, metric selection, automated scoring, and practical guidelines—to help developers and product teams choose the right model for real‑world applications.

AI benchmarkingLLM-as-Judgebusiness AI

0 likes · 18 min read

How to Cut Through the LLM SOTA Hype: Practical Evaluation Strategies for 2025