Tagged articles

AI benchmark

27 articles · Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

Jun 24, 2026 · Artificial Intelligence

Can Agents Truly Self‑Evolve? GDPevo Benchmark That No Agent Can Cheat

The article introduces GDPevo, the first open‑source benchmark that quantifies self‑evolution in agents by generating 120 real‑world enterprise tasks, using rule‑hybrid question creation and deterministic scoring, and shows that self‑evolving agents improve accuracy by 17‑22% while reducing token consumption.

AI benchmarkAgent evaluationContinual Learning

0 likes · 12 min read

Can Agents Truly Self‑Evolve? GDPevo Benchmark That No Agent Can Cheat

Architect's Tech Stack

Jun 10, 2026 · Artificial Intelligence

Claude Fable 5 Launch: Double‑Price, Explosive Performance Gains

Claude Fable 5 has launched with token pricing twice that of Opus 4.8, but delivers dramatically higher benchmark scores—80.3% on SWE‑bench Pro, 95.0% on SWE‑bench Verified—and real‑world speedups such as completing a 50 M‑line Ruby migration in a single day.

AI benchmarkClaudeFable 5

0 likes · 4 min read

Claude Fable 5 Launch: Double‑Price, Explosive Performance Gains

Machine Heart

Jun 9, 2026 · Artificial Intelligence

How HRM-Text Achieves 1B‑Parameter, $1K Training Cost and State‑of‑the‑Art Benchmarks

HRM-Text, a 1‑billion‑parameter model trained for under two days on 16 H100 GPUs at a cost of about $1,500, uses a hierarchical recursive architecture, a focused answer‑only loss, and a PrefixLM mask to reach competitive scores on MATH, GSM8K, and ARC‑Challenge, demonstrating an efficient alternative to scaling‑only approaches.

AI benchmarkEfficient PretrainingHRM-Text

0 likes · 19 min read

How HRM-Text Achieves 1B‑Parameter, $1K Training Cost and State‑of‑the‑Art Benchmarks

SuanNi

Jun 2, 2026 · Artificial Intelligence

Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark

Washington University’s JobBench benchmark, built on a 1,500‑person Workbank survey and 130 real‑world tasks, measures how well AI agents can handle the chores professionals most want to delegate, revealing that even the strongest model, Claude Opus 4.7 + Claude Code, achieves just 45.9% overall, far below human‑level performance.

AI benchmarkJobBenchLLM evaluation

0 likes · 13 min read

Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark

Machine Heart

May 26, 2026 · Artificial Intelligence

Can China’s SkyClaw‑v1.0 Challenge Claude Opus 4.6 with High Performance at Low Cost?

SkyClaw‑v1.0, a domestically released Agent model, delivers benchmark scores that surpass many open‑source rivals and approach top‑tier closed models like Claude Opus 4.6, while offering a dramatically lower price and a frictionless deployment experience for developers.

AI benchmarkAgentClaude Opus 4.6

0 likes · 12 min read

Can China’s SkyClaw‑v1.0 Challenge Claude Opus 4.6 with High Performance at Low Cost?

Old Zhang's AI Learning

May 20, 2026 · Artificial Intelligence

Qwen 3.7‑Max vs Claude 4.7: 7 In‑Depth Tests Reveal a Smooth, Powerful Model

The author evaluates Alibaba’s newly released Qwen 3.7‑Max across seven rigorous tasks—including reading comprehension, HTML fireworks generation, 3D particle visualizations, PDF‑to‑PPT conversion, Excel data analysis, GitHub trending scraping, and complex video generation—showing it often surpasses GPT‑5.5‑level models and rivals Claude 4.7, especially in long‑duration agent tasks.

AI benchmarkAgentClaude 4.7

0 likes · 9 min read

Qwen 3.7‑Max vs Claude 4.7: 7 In‑Depth Tests Reveal a Smooth, Powerful Model

DataFunTalk

May 19, 2026 · Artificial Intelligence

Qwen 3.7 Max Preview Lands: Rapid Dual‑Model Iteration Keeps China’s Lead in Text and Vision

The Qwen 3.7‑Max and Qwen 3.7‑Plus preview models debut with top‑15 global rankings in Arena, the only Chinese models in text and vision leaderboards, while a timeline analysis shows the Qwen series accelerating from 4‑6‑month releases to a 2‑3‑month cadence and introducing dense and MoE variants up to 235 B parameters.

AI benchmarkChinese AILarge Language Model

0 likes · 6 min read

Qwen 3.7 Max Preview Lands: Rapid Dual‑Model Iteration Keeps China’s Lead in Text and Vision

Aikesheng Open Source Community

May 11, 2026 · Artificial Intelligence

SCALE April 2026 Large‑Model SQL Capability Ranking Unveiled

The SCALE April 2026 report adds four new models—DeepSeek‑V4‑Pro, DeepSeek‑V4‑Flash, GPT‑5.5 and Claude Opus 4.7—to its SQL capability leaderboard, evaluates them across SQL understanding, optimization and dialect conversion, and highlights each model’s strengths, weaknesses, and recommended deployment scenarios.

AI benchmarkDialect ConversionLarge Language Models

0 likes · 17 min read

SCALE April 2026 Large‑Model SQL Capability Ranking Unveiled

Machine Heart

May 2, 2026 · Artificial Intelligence

Why GPT‑5.5 and Claude Opus 4.7 Score Below 1% on ARC‑AGI‑3 While Humans Achieve 100%

The ARC‑AGI‑3 benchmark shows that GPT‑5.5 (0.43%) and Claude Opus 4.7 (0.18%) fail to solve any of the 135 novel environments, whereas a six‑year‑old human solves them all, and the analysis attributes the gap to three concrete failure modes and differing compression abilities of the two models.

AI benchmarkARC-AGI-3Claude Opus 4.7

0 likes · 10 min read

Why GPT‑5.5 and Claude Opus 4.7 Score Below 1% on ARC‑AGI‑3 While Humans Achieve 100%

Machine Heart

Apr 21, 2026 · Artificial Intelligence

Kimi K2.6 Unveils 300‑Agent Swarm, Ending the Single‑Agent Era

The newly released Kimi K2.6 model expands the Agent Swarm to coordinate up to 300 agents, delivers significant gains in coding speed, long‑context understanding, and benchmark performance that surpasses GPT‑5.4, Claude Opus and Gemini, while showcasing end‑to‑end front‑end generation demos.

AI benchmarkAgent SwarmKimi K2.6

0 likes · 9 min read

Kimi K2.6 Unveils 300‑Agent Swarm, Ending the Single‑Agent Era

Machine Heart

Apr 20, 2026 · Artificial Intelligence

Does OpenClaw Remember You? Cambridge Launches ATM‑Bench for Long‑Term Memory

CAMBRIDGE's new ATM‑Bench evaluates AI assistants' ability to retrieve personal memories spanning years across multimodal data, revealing that leading agents like OpenClaw, Codex, and Claude Code achieve under 40% accuracy and struggle despite extensive toolchains, highlighting a fundamental long‑term memory challenge.

AI benchmarkATM-BenchClaude Code

0 likes · 8 min read

Does OpenClaw Remember You? Cambridge Launches ATM‑Bench for Long‑Term Memory

AI Engineering

Apr 1, 2026 · Artificial Intelligence

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

H Company’s new Holo3 series delivers a visual language model that outperforms GPT‑5.4 on the OSWorld‑Verified benchmark with a 78.85% score while costing only about one‑tenth as much, offering both a flagship API‑only version and an open‑source lightweight variant optimized for GUI agents.

AI benchmarkGUI AgentHolo3

0 likes · 4 min read

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

PaperAgent

Mar 21, 2026 · Artificial Intelligence

Can AI Truly Be Creative? Inside the CreativeBench Benchmark

This article examines the CreativeBench benchmark, which redefines machine creativity by measuring both the quality and novelty of generated solutions, explains its combinatorial and exploratory task designs, details the self‑evolving task construction process, and discusses key findings and the EvoRePE enhancement method.

AI benchmarkEvoRePELarge Language Models

0 likes · 18 min read

Can AI Truly Be Creative? Inside the CreativeBench Benchmark

SuanNi

Mar 20, 2026 · Artificial Intelligence

How SkillCraft Shows AI Agents Can Cut Compute Costs by Up to 80%

SkillCraft, a new benchmark from Oxford and partner institutions, evaluates whether AI agents can autonomously combine basic tools into reusable skills, revealing that stronger models dramatically improve task success rates while slashing compute consumption by up to 80%, and exposing the limits of hierarchical skill nesting and cross‑model skill sharing.

AI benchmarkSkillCraftcompute efficiency

0 likes · 15 min read

How SkillCraft Shows AI Agents Can Cut Compute Costs by Up to 80%

Alibaba Cloud Developer

Mar 16, 2026 · Artificial Intelligence

HeartBench: Building the First Chinese AI Humanization Benchmark

This article details the creation of HeartBench, a Chinese benchmark for evaluating large language models' emotional and social intelligence, describing its background, design principles, data pipeline, evaluation methods, multi‑stage versioning, blind‑test validation, and lessons for building transferable AI assessment frameworks.

AI benchmarkEmotion AIEvaluation

0 likes · 25 min read

HeartBench: Building the First Chinese AI Humanization Benchmark

Old Zhang's AI Learning

Jan 27, 2026 · Artificial Intelligence

Qwen3‑Max‑Thinking Boosts Performance with Test‑Time Scaling—Why It Still Isn’t Open‑Source

Alibaba’s new Qwen3‑Max‑Thinking model adds inference‑time scaling and adaptive tool use, delivering large gains on math, coding, and agent benchmarks while remaining closed‑source, and it offers drop‑in OpenAI‑compatible API access at the cost of higher latency and token usage.

AI benchmarkAdaptive Tool UseLarge Language Model

0 likes · 7 min read

Qwen3‑Max‑Thinking Boosts Performance with Test‑Time Scaling—Why It Still Isn’t Open‑Source

PaperAgent

Dec 23, 2025 · Artificial Intelligence

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.

AI benchmarkAgent evaluationCATArena

0 likes · 8 min read

CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning

Wuming AI

Sep 6, 2025 · Artificial Intelligence

Can Qwen3-Max-Preview Outperform Claude? A Deep Dive into China’s New 1‑T LLM

The article reviews Alibaba's 1‑trillion‑parameter Qwen3‑Max‑Preview model, comparing its benchmark scores, hallucination rate, math and coding accuracy, and SVG generation quality against Claude, Kimi K2, and DeepSeek, while providing usage links and real‑world user impressions.

AI benchmarkLarge Language ModelQwen3

0 likes · 4 min read

Can Qwen3-Max-Preview Outperform Claude? A Deep Dive into China’s New 1‑T LLM

Programmer DD

Apr 29, 2025 · Artificial Intelligence

Why Qwen3 Is Redefining Open‑Source LLMs: Mixed‑Inference Power and Unmatched Performance

Qwen3, Alibaba’s latest open‑source large language model, introduces a pioneering mixed‑inference architecture that blends top‑tier reasoning and non‑reasoning capabilities, delivering record‑breaking benchmark scores, multilingual support for 119 languages, cost‑effective deployment, and a 128K context window, now accessible via Ollama and OpenRouter.

AI benchmarkLarge Language ModelQwen3

0 likes · 5 min read

Why Qwen3 Is Redefining Open‑Source LLMs: Mixed‑Inference Power and Unmatched Performance

Baidu Geek Talk

Apr 16, 2025 · Industry Insights

What Do the Latest AIIA FactTesting Benchmarks Reveal About China’s Large Language Models?

At the AIIA’s 14th plenary meeting in Nanjing, the FactTesting benchmark released its Q1 2025 results, evaluating over 200 large models and highlighting Baidu’s Wenxin 4.5 and Wenxin X1 as leaders in basic and reasoning capabilities, while outlining the expanded multimodal and agent testing roadmap for the year.

AI benchmarkChina AIFactTesting

0 likes · 5 min read

What Do the Latest AIIA FactTesting Benchmarks Reveal About China’s Large Language Models?

Top Architect

Mar 9, 2025 · Artificial Intelligence

Alibaba Unveils Qwen QwQ-32B: A Compact Open‑Source LLM Rivaling DeepSeek

Alibaba has released the open‑source Qwen QwQ‑32B model, a 32‑billion‑parameter LLM that matches DeepSeek‑R1's performance while being deployable on consumer‑grade GPUs, and the announcement is accompanied by extensive promotional offers for AI‑related products and services.

AI benchmarkAlibabaLarge Language Model

0 likes · 7 min read

Alibaba Unveils Qwen QwQ-32B: A Compact Open‑Source LLM Rivaling DeepSeek

Java Tech Enthusiast

Mar 8, 2025 · Artificial Intelligence

QwQ-32B Large Language Model Overview and Performance

Alibaba’s new QwQ‑32B large‑language model, with 32 billion parameters, delivers performance comparable to or surpassing the 671‑billion‑parameter DeepSeek‑R1 across math, coding, and general benchmarks, and is available via HuggingFace, ModelScope, and a DashScope API demo with example Python code.

AI benchmarkLarge Language ModelPython API

0 likes · 5 min read

QwQ-32B Large Language Model Overview and Performance

Alibaba Cloud Big Data AI Platform

Mar 7, 2025 · Artificial Intelligence

How QwQ-32B Outperforms OpenAI o1-mini and Deploys in One Click on Alibaba Cloud

Alibaba Cloud's newly released QwQ-32B model delivers benchmark‑level performance rivaling top open‑source LLMs, integrates agent capabilities, and can be deployed with a single click through the PAI‑Model Gallery, offering a cost‑effective solution for developers seeking advanced AI inference.

AI benchmarkAlibaba CloudLLM

0 likes · 5 min read

How QwQ-32B Outperforms OpenAI o1-mini and Deploys in One Click on Alibaba Cloud

AI Algorithm Path

Feb 22, 2025 · Artificial Intelligence

Elon Musk Unveils Grok 3, Claiming the World’s Most Powerful AI Model

The article details the launch of Grok 3 by Elon Musk’s xAI, highlighting its massive GPU infrastructure, benchmark dominance over GPT‑4o, multiple model variants, pricing for Premium+ users, upcoming API and voice features, and the team’s plan to open‑source Grok 2 once the new model stabilises.

AI benchmarkAI pricingElon Musk

0 likes · 6 min read

Elon Musk Unveils Grok 3, Claiming the World’s Most Powerful AI Model

Architects' Tech Alliance

Feb 10, 2025 · Industry Insights

What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI

This article explains what defines a large AI model, compares parameter scales of GPT‑3, GPT‑4 and M6, and analyzes DeepSeek’s recent releases—V3, R1, and Janus‑Pro—highlighting their benchmark performance, reinforcement‑learning techniques, and cost efficiency versus leading proprietary models.

AI benchmarkDeepSeekModel Scaling

0 likes · 5 min read

What Makes DeepSeek’s New V3 Model Rival GPT‑4o? A Deep Dive into Large‑Scale AI

JavaEdge

Dec 1, 2024 · Artificial Intelligence

Exploring the Limits and Benchmarks of Qwen’s QwQ‑32B‑Preview AI Model

QwQ‑32B‑Preview, an experimental AI model from the Qwen team, showcases strong reasoning in math and programming while facing challenges like language switching, inference loops, safety concerns, and variable capabilities across domains, with benchmark scores ranging from 50% to over 90% on tests such as GPQA, AIME, MATH‑500, and LiveCodeBench.

AI benchmarkLLMMachine Learning

0 likes · 7 min read

Exploring the Limits and Benchmarks of Qwen’s QwQ‑32B‑Preview AI Model

Java Tech Enthusiast

Jul 12, 2024 · Artificial Intelligence

Why Alibaba’s Qwen‑2 Is Outperforming Global LLMs and What It Means for AI

After OpenAI halted API access in China, Alibaba’s Tongyi Qwen‑2 quickly rose to the top of global open‑source LLM leaderboards, surpassing Meta’s Llama‑3 and other contenders, with detailed benchmark scores, performance gains over previous versions, and implications for China’s AI ecosystem.

AI benchmarkAlibabaChina AI

0 likes · 5 min read

Why Alibaba’s Qwen‑2 Is Outperforming Global LLMs and What It Means for AI