Tagged articles

agent performance

4 articles · Page 1 of 1

Jun 2, 2026 · Artificial Intelligence

Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark

Washington University’s JobBench benchmark, built on a 1,500‑person Workbank survey and 130 real‑world tasks, measures how well AI agents can handle the chores professionals most want to delegate, revealing that even the strongest model, Claude Opus 4.7 + Claude Code, achieves just 45.9% overall, far below human‑level performance.

AI BenchmarkJobBenchLLM evaluation

0 likes · 13 min read

Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark

Machine Learning Algorithms & Natural Language Processing

May 15, 2026 · Artificial Intelligence

ClawMark: A Living‑World Benchmark for Multi‑Turn, Multi‑Day, Multimodal Coworker Agents

The ClawMark benchmark introduces 100 multi‑turn, multi‑day tasks across 13 professional scenarios and five stateful sandbox services, evaluating seven cutting‑edge agent systems with a top weighted score of 75.8 but only a 20% strict success rate, highlighting the difficulty of end‑to‑end collaborative agent performance.

LLMagent performancebenchmark

0 likes · 4 min read

ClawMark: A Living‑World Benchmark for Multi‑Turn, Multi‑Day, Multimodal Coworker Agents

Baidu Geek Talk

Apr 22, 2026 · Artificial Intelligence

How to Quantify AI Skill Quality with an 8‑Dimension Evaluation Framework

This article introduces an eight‑dimensional, weighted scoring system for evaluating AI Skills, explains each metric, demonstrates the framework on real‑world Skills, compares similar Skills, and shows how multi‑model cross‑validation and four execution strategies improve assessment reliability.

AI skill evaluationMetadata Qualityagent performance

0 likes · 15 min read

How to Quantify AI Skill Quality with an 8‑Dimension Evaluation Framework

Machine Learning Algorithms & Natural Language Processing

Mar 19, 2026 · Artificial Intelligence

From Language Modeling to World Modeling: Limits of Large Language Models

Speaker Li Yixia from Southern University of Science and Technology presents a talk on using large language models as textual world models, defining a three‑layer evaluation framework and showing through experiments that fine‑tuned models improve next‑state prediction and agent performance, yet face limits tied to behavior coverage and environment complexity.

Reinforcement Learningagent performanceevaluation framework

0 likes · 4 min read

From Language Modeling to World Modeling: Limits of Large Language Models