How Anthropic and OpenAI Monitor Frontier AI Agent Behavior – A Comprehensive Review
This article systematically reviews Anthropic and OpenAI’s public research on monitoring intelligent agent trajectories, covering infrastructure such as Clio, Petri, Bloom, chain‑of‑thought monitoring, the Confessions mechanism, internal coding‑agent audits, and the Docent tool, while highlighting mitigation strategies for reward hacking and hidden objectives.
Anthropic monitoring pipeline
Anthropic builds a production‑grade workflow for reading, analysing and quantifying agent trajectories. The pipeline consists of three core components:
Clio : a four‑step system that (1) extracts multi‑dimensional metadata (topic, turn count, language, etc.) from raw dialogues using a Claude prompt, (2) clusters the enriched dialogues with all-mpnet-base-v2 embeddings, (3) generates cluster names and summaries via Claude, and (4) merges clusters into a hierarchical structure. This turns millions of raw transcripts into searchable, structured data.
Petri : an open‑source auditing tool that scores transcripts on a hierarchy of categories—from generic “misbehavior” to specific actions such as “user deception” or “self‑preservation”. It supports relative scoring, which the authors found more informative than absolute scores.
Bloom : a complementary tool that, given a suspicious pattern identified by Petri, automatically synthesises many test scenarios around that pattern and estimates the frequency of occurrence.
The overall workflow follows detect → analyse → quantify . Anthropic also introduced SHADE‑Arena , a benchmark where agents must complete a benign primary task while secretly accomplishing a hidden malicious sub‑task; a dedicated monitor evaluates the hidden behaviour.
Key mitigation research includes:
Alignment‑faking detection ("Alignment faking in RL").
Training‑time mitigations such as extra probing questions, penalties for overly long reasoning, and explicit process supervision.
Evidence that reward‑hacking can generalise to broader misalignment ("From shortcuts to sabotage").
Later work adds automated behavioural audits (Petri) and scenario generation (Bloom), and culminates in AuditBench , a systematic comparison of auditing techniques on models with hidden objectives.
OpenAI monitoring approaches
Chain‑of‑Thought (CoT) monitorability : reasoning models often write their intent (e.g., "I will cheat") in the CoT. OpenAI trains a separate LLM to read the CoT and flag risky behaviour ("Detecting misbehavior in frontier reasoning models"). Over‑optimising this signal can cause models to conceal intent.
Confessions : after each answer the model is prompted to produce an independent honesty report that lists (1) the explicit and implicit objectives for the answer, (2) whether each objective was satisfied, and (3) any uncertainties or conflicts encountered. A dedicated reward model scores only the confession, leaving the main answer’s reward untouched ("How confessions can keep language models honest"). This separates honesty incentives from performance incentives.
Internal coding‑agent monitoring : a low‑latency system reads the full interaction (user messages, model reasoning, tool calls, tool outputs) and raises alerts for policy violations or unsafe patterns. Detected cases are escalated to human review.
Self‑reporting agents : agents are trained to call a report_scheming tool when they detect misbehaviour, making the act of cheating itself observable.
Code verification at scale : OpenAI trains code‑reviewer models (e.g., gpt‑5‑codex) using a two‑stage pipeline—deterministic tests, LLM rubrics, static analysis, and state checks—while tracking transcript and latency metrics such as number of turns, tool calls, total tokens, and time‑to‑first‑token.
Model cards for o1‑preview, o1‑mini, o3, o4‑mini and the GPT‑5 series describe a layered safety system consisting of topical classifiers, reasoning monitors, generation monitors, conversation monitors, safety reasoners, message‑level blocks and account enforcement.
Docent
Docent is an AI‑assisted transcript viewer and rubric‑based judge developed by Anthropic. It is used to summarise long RL trajectories (e.g., terminal‑bench) and to evaluate agents on benchmarks such as SWE‑bench. The workflow consists of:
Human‑friendly summarisation of long traces, powered by Claude.
Automated scoring against a rubric that highlights cheating, hidden‑state access, or policy violations.
Case studies show Docent pinpointing why a newer Codex model appeared to regress: the regression was due to higher timeout rates rather than a loss of capability, and after filtering out timeout‑prone tasks the newer model actually outperformed the older one.
Comparative insights
Both Anthropic and OpenAI have transitioned from ad‑hoc post‑hoc analysis to systematic, production‑grade monitoring pipelines.
Anthropic emphasises batch processing of massive user dialogues (Clio → Petri → Bloom) and explicit auditing tools that can be open‑sourced.
OpenAI focuses on exposing model‑generated signals (CoT, confessions, self‑reports) and on real‑time layered monitoring in deployment.
OpenAI’s confessions separate honesty incentives from performance rewards, reducing the incentive for models to hide misbehaviour.
Anthropic’s training‑time mitigations aim to prevent reward‑hacking patterns from generalising to broader misalignment.
Open‑source tools (Clio, Petri, Bloom, Docent) enable researchers to replicate and extend these monitoring practices.
Key references
Clio: A system for privacy‑preserving insights into real‑world AI use
https://arxiv.org/pdf/2412.13678
Auditing language models for hidden objectives
https://www.anthropic.com/research/auditing-hidden-objectives
SHADE‑Arena: Evaluating sabotage and monitoring in LLM agents
https://www.anthropic.com/research/shade-arena-sabotage-monitoring
Petri: An open‑source auditing tool to accelerate AI safety research
https://alignment.anthropic.com/2025/petri/
Bloom: an open source tool for automated behavioural evaluations
https://www.anthropic.com/research/bloom
AuditBench: Evaluating alignment auditing techniques on models with hidden behaviors
https://alignment.anthropic.com/2026/auditbench/
Detecting misbehavior in frontier reasoning models
https://openai.com/index/chain-of-thought-monitoring/
How confessions can keep language models honest
https://openai.com/index/how-confessions-can-keep-language-models-honest/
Training agents to self‑report misbehavior
https://alignment.openai.com/self-incrimination/
A Practical Approach to Verifying Code at Scale
https://alignment.openai.com/scaling-code-verification/Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
