Tagged articles

evaluation framework

18 articles · Page 1 of 1

Jul 4, 2026 · Artificial Intelligence

Iterating Agent Skills with SkillRevise: Using Execution Traces for Continuous Improvement

SkillRevise tackles the overestimation of LLM‑authored agent skills by breaking down complex search tasks, attaching evidence to verifiable sources, and introducing trace‑conditioned revisions that let engineers pinpoint and fix failures across retrieval, reasoning, and presentation layers.

LLM AgentsRAGSkillRevise

0 likes · 14 min read

Iterating Agent Skills with SkillRevise: Using Execution Traces for Continuous Improvement

PaperAgent

Jun 11, 2026 · Artificial Intelligence

Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework

The Skill‑RM paper reveals that simply appending evaluation resources can degrade large‑model scoring, while structuring those resources into a Reward‑Evaluation Skill boosts performance across benchmarks, best‑of‑N selection, and RL‑based instruction following.

Alibaba QwenLarge Language ModelsRLHF

0 likes · 7 min read

Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework

IT Architects Alliance

Jun 9, 2026 · Artificial Intelligence

From Implementer to Orchestrator: 7 Essential Skills Every 2026 Architect Must Master

The article shares a practitioner’s journey from chasing every new AI framework to focusing on seven durable capabilities—context management, tool design, data‑driven evaluation, robust harness, isolation, traceability, cost control, and disciplined multi‑agent collaboration—that will keep architects productive for years to come.

AI agentsContext ManagementHarness

0 likes · 11 min read

From Implementer to Orchestrator: 7 Essential Skills Every 2026 Architect Must Master

Data Party THU

Jun 1, 2026 · Artificial Intelligence

How Steering Unlocks Controllable Large Models: Mechanisms, Evaluation, and Open‑Source Tools

This article reviews two ACL 2026 papers that explain why steering works for large language models, introduce a three‑stage behavior model and activation‑manifold hypothesis, propose the SPLIT method, present the SteerEval evaluation framework, and describe the EasyEdit2 open‑source toolkit.

Activation ManifoldEasyEdit2Large Language Models

0 likes · 13 min read

How Steering Unlocks Controllable Large Models: Mechanisms, Evaluation, and Open‑Source Tools

dbaplus Community

May 16, 2026 · Artificial Intelligence

Can Your AI Skill Pass? An 8‑Dimension Quantitative Evaluation Framework

This article introduces an eight‑dimension quantitative framework for assessing AI Skills, detailing each metric—from metadata quality to scope focus—explaining weighted scoring, demonstrating evaluations on real Skills and comparative cases, and presenting a multi‑model cross‑validation process with four execution strategies to turn subjective judgments into measurable grades.

AI SkillExecution StrategyMetadata Quality

0 likes · 16 min read

Can Your AI Skill Pass? An 8‑Dimension Quantitative Evaluation Framework

AI Tech Publishing

Apr 27, 2026 · Artificial Intelligence

Why Build Your Own AI Evaluation Harness? 7 OpenAI‑Inspired Recommendations

The article explains why generic AI testing platforms fall short, outlines how to design a testable AI system from day one, and presents seven practical recommendations—from using Codex or Claude Code to manage regression and iteration test sets, to leveraging entropy diagnostics and custom domain‑expert UX.

AI evaluationOpenAIRegression testing

0 likes · 8 min read

Why Build Your Own AI Evaluation Harness? 7 OpenAI‑Inspired Recommendations

PMTalk Product Manager Community

Apr 14, 2026 · Product Management

Why Evaluation and Decomposition, Not Prototyping, Are the Core Skills for AI Product Managers

Traditional product tactics like building features first and relying on gradual rollout no longer work for AI agents; instead, AI product managers must adopt a rigorous, scenario‑driven evaluation framework that measures result quality, task completion, tool correctness, and security to ensure trustworthy, business‑critical performance.

AI ReliabilityAI product managementAgent AI

0 likes · 10 min read

Why Evaluation and Decomposition, Not Prototyping, Are the Core Skills for AI Product Managers

AI Step-by-Step

Mar 28, 2026 · Artificial Intelligence

How to Evaluate Agent Performance Across Different Scenarios

The article proposes a four‑dimensional framework—task result, output structure, behavior boundary, and long‑term stability—to systematically validate AI agents in varied business contexts such as e‑commerce, manufacturing, insurance, and HR, emphasizing concrete evidence over subjective impressions.

AI AgentR&D ManagementScenario Validation

0 likes · 10 min read

How to Evaluate Agent Performance Across Different Scenarios

Machine Learning Algorithms & Natural Language Processing

Mar 19, 2026 · Artificial Intelligence

From Language Modeling to World Modeling: Limits of Large Language Models

Speaker Li Yixia from Southern University of Science and Technology presents a talk on using large language models as textual world models, defining a three‑layer evaluation framework and showing through experiments that fine‑tuned models improve next‑state prediction and agent performance, yet face limits tied to behavior coverage and environment complexity.

Large Language Modelsagent performanceevaluation framework

0 likes · 4 min read

From Language Modeling to World Modeling: Limits of Large Language Models

Youzan Coder

Jan 13, 2026 · Artificial Intelligence

From Hackathon to Scalable AI Customer Service: Lessons and Best Practices

This article chronicles the end‑to‑end development of an AI‑driven customer service system, detailing the shift from a rapid‑prototype Dify platform to a hybrid engineering architecture, model selection strategies, workflow design, knowledge engineering, evaluation methods, and future directions for continuous improvement.

AI Customer ServicePrompt engineeringWorkflow Engineering

0 likes · 21 min read

From Hackathon to Scalable AI Customer Service: Lessons and Best Practices

AI Tech Publishing

Jan 10, 2026 · Artificial Intelligence

Anthropic Engineers Reveal a Pragmatic Framework for Evaluating AI Agents

Anthropic engineers outline why rigorous AI Agent evaluation is essential, describe a comprehensive evaluation harness with tasks, trials, graders, and transcripts, compare capability and regression tests, discuss code-, model-, and human-based graders, and present an eight-step roadmap for building reliable Agent assessment pipelines.

AI AgentCapability EvaluationCode-based Grader

0 likes · 12 min read

Anthropic Engineers Reveal a Pragmatic Framework for Evaluating AI Agents

PaperAgent

Jan 10, 2026 · Artificial Intelligence

How to Build Robust Evaluations for AI Agents: A Complete Roadmap

Anthropic’s new blog reveals a comprehensive framework for evaluating AI agents, detailing evaluation structures, metrics like pass@k and pass^k, types of scorers, multi‑round testing, and a step‑by‑step roadmap for designing, maintaining, and integrating automated assessments into agent development pipelines.

AI agentsAI evaluationagent testing

0 likes · 15 min read

How to Build Robust Evaluations for AI Agents: A Complete Roadmap

AI2ML AI to Machine Learning

Sep 24, 2025 · Artificial Intelligence

Key Points for Evaluating AI Agents

The article explains how Coze's Compass introduces a flexible evaluation system for AI agents, outlines a four‑dimensional submodule assessment (planning, tool use, self‑reflection, memory), and details specific testing criteria and challenges for web, scientific, dialogue, and programming agents.

AI agentsBenchmarkingCoze

0 likes · 6 min read

Bighead's Algorithm Notes

Aug 31, 2025 · Artificial Intelligence

Paper Review: AlphaEval – A Comprehensive, Efficient Framework for Evaluating Alpha Mining

AlphaEval is a unified, parallelizable evaluation framework that assesses Alpha mining models across predictive ability, time stability, market‑perturbation robustness, financial logic, and diversity without backtesting, matching full backtest results while offering higher efficiency and open‑source reproducibility.

Alpha MiningLLMRobustness

0 likes · 10 min read

Paper Review: AlphaEval – A Comprehensive, Efficient Framework for Evaluating Alpha Mining

DataFunSummit

Jun 13, 2024 · Product Management

Data‑Driven KOL Marketing Strategies for Game Growth in Western Markets

This article explains how Tencent IEGG leverages data‑driven KOL marketing across four key scenarios—budget planning, KOL evaluation, performance measurement, and competitor monitoring—to address cultural differences, optimize spend, and boost game adoption in the European‑American PC and console markets.

Budget PlanningCompetitive MonitoringGame Marketing

0 likes · 16 min read

Data‑Driven KOL Marketing Strategies for Game Growth in Western Markets

Airbnb Technology Team

Nov 3, 2022 · Artificial Intelligence

T-LEAF: A Taxonomy Learning and Evaluation Framework for Airbnb Community Support Classification System

The T‑LEAF framework introduces quantitative metrics for coverage, usefulness, and consistency to iteratively develop Airbnb’s unified Contact‑Reason taxonomy, enabling faster feedback loops, reducing “Other” classifications, and improving both human annotation agreement and machine‑learning prediction accuracy in production.

Community Supportclassificationevaluation framework

0 likes · 14 min read

T-LEAF: A Taxonomy Learning and Evaluation Framework for Airbnb Community Support Classification System

DataFunTalk

Feb 13, 2022 · Big Data

How Kuaishou Built a Standardized Data Governance Evaluation System

This article outlines Kuaishou’s approach to establishing a standardized data governance evaluation framework, detailing the challenges of large‑scale data management, the design of assessment metrics across model, quality, and cost dimensions, and the practical strategies and operational mechanisms used to improve data asset health and business value.

Big DataKuaishouevaluation framework

0 likes · 21 min read

How Kuaishou Built a Standardized Data Governance Evaluation System

Efficient Ops

May 18, 2020 · Artificial Intelligence

How China’s AI Alliance Is Shaping RPA Evaluation Standards

The article outlines the AIIA‑hosted RPA technology salon, details the newly built RPA evaluation framework, explains RPA fundamentals and AI‑driven RPA 4.0 trends, and presents the alliance’s roadmap for standards and testing to boost successful automation deployments.

AI integrationRPAStandards

0 likes · 6 min read

How China’s AI Alliance Is Shaping RPA Evaluation Standards