DataFunTalk
DataFunTalk
Jan 21, 2026 · Artificial Intelligence

Why Traditional Coding Benchmarks Miss the Mark: Inside OctoCodingBench’s Process‑Level Evaluation

The article examines the rapid progress of AI coding agents, critiques existing benchmarks that only measure final correctness, and introduces OctoCodingBench—a new suite that simulates real‑world constraints, records full interaction traces, and evaluates both task success and strict process compliance across multiple languages.

AI evaluationLLM-as-Judgecoding agents
0 likes · 10 min read
Why Traditional Coding Benchmarks Miss the Mark: Inside OctoCodingBench’s Process‑Level Evaluation
Baidu Tech Salon
Baidu Tech Salon
Oct 10, 2025 · Artificial Intelligence

Navigating the 2025 AI Model Boom: Practical Evaluation Strategies

This article examines the rapid surge of large AI models in 2024‑2025, critiques the reliability of public leaderboards, and presents a business‑focused evaluation framework—including dataset construction, metric selection, automation, and LLM‑as‑judge techniques—to help developers choose the right model for real‑world applications.

AI benchmarksAI performanceLLM-as-Judge
0 likes · 17 min read
Navigating the 2025 AI Model Boom: Practical Evaluation Strategies
AntTech
AntTech
Sep 19, 2025 · Artificial Intelligence

How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach

Ant Insurance’s tech team leveraged reinforcement learning, focused data selection, and a multi‑dimensional reward system to dramatically reduce hallucinations in LLMs, achieving top‑rank performance on the HHEM leaderboard and robust improvements across instruction‑following and reasoning‑enhanced models.

Hallucination ControlLLMLLM-as-Judge
0 likes · 6 min read
How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach
Baidu Geek Talk
Baidu Geek Talk
Sep 10, 2025 · Artificial Intelligence

How to Cut Through the LLM SOTA Hype: Practical Evaluation Strategies for 2025

Amid the 2025 surge of large language models, this article demystifies misleading SOTA claims, critiques benchmark reliability, and presents a comprehensive, business‑focused evaluation framework—including dataset construction, metric selection, automated scoring, and practical guidelines—to help developers and product teams choose the right model for real‑world applications.

AI benchmarkingLLM-as-Judgebusiness AI
0 likes · 18 min read
How to Cut Through the LLM SOTA Hype: Practical Evaluation Strategies for 2025