Artificial Intelligence 19 min read

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

The article evaluates nine cutting‑edge AI models on 150 simulated workplace tasks, showing that even the strongest models complete fewer than 40% of tasks, and uses these results to propose a hierarchical framework of agentic capabilities ranging from tool use to common‑sense reasoning.

AI Tech Publishing

Nov 17, 2025

Frontier AI Models in RL Environments Reveal an Agent Capability Hierarchy

Overview

We placed nine state‑of‑the‑art AI models in a reinforcement‑learning (RL) environment that mimics a real‑world workplace and assigned them 150 diverse tasks. Most models barely manage the tasks, and even the best (GPT‑5 and Claude Sonnet 4.5) succeed on less than 40% of them, exposing a lack of common‑sense reasoning.

Building an RL Environment

According to the article, a useful RL environment requires three components:

A coherent world model : defines the overall structure of the background.

A set of entities : objects in the world and their relationships.

A tool system : the interface through which agents interact with entities.

These environments must be grounded in real employee experience rather than abstract simulations, and they evolve organically over time.

Agent Capability Hierarchy

The authors introduce a “Hierarchy of Agentic Capabilities” that positions models on a pyramid from low‑level tool use up to high‑level common‑sense reasoning. The levels are:

Tool use, goal formulation and basic planning.

Adaptability and groundedness.

Common‑sense reasoning.

Images illustrate the pyramid and each model’s current position (e.g., GPT‑5 and Claude Sonnet 4.5 sit near the top but still exhibit frequent errors).

Step 1 – Basic Tool Use, Planning and Goal Setting

Success at this level requires four abilities:

Decompose a multi‑step task into sub‑goals.

Identify the appropriate tool for each sub‑goal and the correct order.

Map available information to the correct tool parameters.

Execute the plan step‑by‑step without deviating or omitting details.

Models such as GPT‑4o, Mistral Medium and Nova Pro struggle with these, leading to random‑dice‑like outcomes.

Example: Finding High‑Priority Gold/Platinum Customers

All three models failed to map the “gold” label to a proper customer ID, violating the MCP specification that expects a string for customer_id. Nova Pro’s output shows the literal string “gold” being treated as an ID.

Example: Product Recall Query

The task required a three‑step workflow (search product ID, search orders, return customers). Nova Pro and Mistral Medium jumped directly to step 2 and passed a product name to the product_id parameter, breaking the expected input type.

Adaptability – Updating Plans When Reality Differs

Even when a model can devise a plan, real‑world execution often deviates. Models must detect failures, reinterpret ambiguous tool documentation, and revise the plan.

Gemini 2.5 and Qwen 3 frequently continue a failed sequence without adjustment. Claude Sonnet 4.5, by contrast, retries with alternative search parameters, demonstrating human‑like adaptability.

Groundedness – Staying Connected to the Environment

Models must avoid hallucinations and maintain factual consistency. Kimi K2 Turbo, despite strong planning, repeatedly mis‑dates orders (searching 2024 instead of 2025) and later rewrites the date in its final answer.

Common‑Sense Reasoning – The Final Frontier

Even when tool use, planning, and adaptability are reliable, models still falter on pure common‑sense inference. The article cites GPT‑5’s failure to re‑classify a support ticket as a “return” despite clear textual clues, and its inability to infer that a customer who mentions a refund is likely requesting a return.

Another GPT‑5 example shows it missing the link between a “gaming” context and product relevance, leading to an inefficient exhaustive search over a month’s orders.

Conclusions

The hierarchy demonstrates that mastering low‑level capabilities (tool use, planning) is necessary but not sufficient for human‑level agents. The biggest gap lies in common‑sense reasoning, where even the most advanced models fall short. The authors argue that bridging this gap will define the next stage of AI development, but the timeline remains uncertain.

large language models reinforcement learning tool use AI model evaluation agentic capabilities common sense reasoning

Written by

AI Tech Publishing

In the fast-evolving AI era, we thoroughly explain stable technical foundations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Building an RL Environment

Agent Capability Hierarchy

Step 1 – Basic Tool Use, Planning and Goal Setting

Example: Finding High‑Priority Gold/Platinum Customers

Example: Product Recall Query

Adaptability – Updating Plans When Reality Differs

Groundedness – Staying Connected to the Environment

Common‑Sense Reasoning – The Final Frontier

Conclusions

AI Tech Publishing

How this landed with the community

Was this worth your time?

0 Comments

Step 1 – Basic Tool Use, Planning and Goal Setting