Tagged articles

7 articles

Page 1 of 1

Apr 2, 2026 · Artificial Intelligence

Everything‑Claude‑Code: The 132k‑Star Open‑Source Claude Plugin That Outshines Superpowers

Everything‑Claude‑Code is a massive open‑source Claude Code plugin suite featuring 36 sub‑agents, 150+ skills, and cross‑platform support, offering pass@k validation, AgentShield security scans, and sandboxed agents; the author compares it to Superpowers, highlighting deeper engineering, broader coverage, and higher star count.

AI coding assistantsAgentShieldClaude Code

0 likes · 8 min read

Everything‑Claude‑Code: The 132k‑Star Open‑Source Claude Plugin That Outshines Superpowers

Programmer DD

Jan 12, 2026 · Artificial Intelligence

5 Counterintuitive Lessons for Evaluating AI Agents Effectively

This article shares five surprising, high‑impact lessons from Anthropic on building robust AI agent evaluation suites, covering early failure‑case collections, recognizing clever “failures,” focusing on outcomes over process, choosing the right success metrics, and the irreplaceable value of human review.

AI EvaluationAnthropicMetrics

0 likes · 10 min read

5 Counterintuitive Lessons for Evaluating AI Agents Effectively

AI Tech Publishing

Jan 10, 2026 · Artificial Intelligence

Anthropic Engineers Reveal a Pragmatic Framework for Evaluating AI Agents

Anthropic engineers outline why rigorous AI Agent evaluation is essential, describe a comprehensive evaluation harness with tasks, trials, graders, and transcripts, compare capability and regression tests, discuss code-, model-, and human-based graders, and present an eight-step roadmap for building reliable Agent assessment pipelines.

AI AgentCapability EvaluationCode-based Grader

0 likes · 12 min read

Anthropic Engineers Reveal a Pragmatic Framework for Evaluating AI Agents

PaperAgent

Jan 10, 2026 · Artificial Intelligence

How to Build Robust Evaluations for AI Agents: A Complete Roadmap

Anthropic’s new blog reveals a comprehensive framework for evaluating AI agents, detailing evaluation structures, metrics like pass@k and pass^k, types of scorers, multi‑round testing, and a step‑by‑step roadmap for designing, maintaining, and integrating automated assessments into agent development pipelines.

AI EvaluationAI agentsEvaluation Framework

0 likes · 15 min read

How to Build Robust Evaluations for AI Agents: A Complete Roadmap

Baobao Algorithm Notes

Oct 31, 2025 · Artificial Intelligence

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.

Exploration-ExploitationLLM fine-tuningRS-GRPO

0 likes · 12 min read

How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance

DataFunTalk

Apr 25, 2025 · Artificial Intelligence

Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study

Recent empirical research by Tsinghua’s LeapLab and Shanghai Jiao Tong University reveals that reinforcement‑learning‑based fine‑tuning (RLVR) improves sampling efficiency but does not extend the fundamental reasoning abilities of large language models beyond their base capabilities, as demonstrated across mathematics, code, and visual reasoning benchmarks.

AI researchRLVRlarge language models

0 likes · 12 min read

Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study

Rare Earth Juejin Tech Community

Jul 30, 2023 · Artificial Intelligence

Understanding Codex: Training Framework, Evaluation Methodology, and Model Performance in ChatGPT’s Code Generation Ability

This article explains how Codex, built on the GPT‑3.5 architecture, is trained and fine‑tuned to give ChatGPT the ability to generate code, detailing the data collection, supervised fine‑tuning, evaluation using HumanEval and the pass@k metric, and presenting performance comparisons with GPT‑3 and Codex‑S.

AI model trainingChatGPTCode Generation

0 likes · 11 min read

Understanding Codex: Training Framework, Evaluation Methodology, and Model Performance in ChatGPT’s Code Generation Ability