Artificial Intelligence 10 min read

Can We Trust AI? Inside GPT‑5.2‑Codex’s Monitorability Breakthrough

OpenAI’s new GPT‑5.2‑Codex model achieves state‑of‑the‑art performance on SWE‑Bench Pro and Terminal‑Bench 2.0, and a 90‑page technical report introduces the concept of monitorability, defining metrics, benchmark suites, and key findings about chain‑of‑thought length, RL training, and model size.

PaperAgent

Dec 19, 2025

Can We Trust AI? Inside GPT‑5.2‑Codex’s Monitorability Breakthrough

OpenAI announced GPT‑5.2‑Codex, a new large language model now available in the Codex platform, which sets a new state‑of‑the‑art (SOTA) in software development and defensive cybersecurity coding tasks.

Why has "monitorability" become a core AI safety concern?

With the rise of reasoning models such as o1, o3, Claude‑3.5 Sonnet, and Kimi K2, large models generate a chain‑of‑thought (CoT) before producing a final answer. This CoT can be viewed as the model’s "brainwave"; if it faithfully reflects internal decisions, monitoring CoT can reveal risky behaviors like reward hacking, sabotage, or hallucinations.

Is CoT truly trustworthy?

As models scale and reinforcement learning (RL) continues, will CoT deliberately hide its true intent?

If CoT becomes unreadable, what other monitoring methods remain?

Core concept: What is Monitorability?

OpenAI formally defines monitorability using two complementary metrics:

TPR (True Positive Rate) : probability that a monitor catches a "bad" behavior (sensitivity).

TNR (True Negative Rate) : probability that a monitor does not raise a false alarm on a "good" behavior (specificity).

Traditional F1 can be distorted by class imbalance, whereas the g‑mean² metric remains stable under extreme positive‑negative sample ratios.

Three evaluation paradigms: Intervention, Process, Outcome‑Property

The report categorizes benchmarks into three paradigms:

Intervention : give the model a hint and check whether the CoT acknowledges the influence (e.g., missing evidence in health diagnosis, ethical story prompting).

Process : verify that a correct answer does not rely on hidden theorems; tasks include AIME, FrontierMath, GPQA‑Physics.

Outcome‑Property : detect obviously anomalous results (e.g., impossible coding tasks, repository poisoning, ShadeArena).

图 5：干预类任务中 GPT-5 Thinking 的 g-mean² 与 TPR/TNR

图 7：结果属性类任务中，CoT-only 监控已大幅优于 action-only

Key finding 1: Longer CoT improves monitorability

Across seven reasoning models (GPT‑5, o3, o3‑mini, o4‑mini, Claude‑3.7 Sonnet, DeepSeek R1, Kimi K2), the study observes that:

CoT‑only monitoring significantly outperforms action‑only monitoring (which only looks at tool calls and final answer).

Within the same model, higher inference effort / longer CoT yields an almost monotonic increase in g‑mean² (see left side of Figure 1).

图 1 左：随推理 effort 增加，CoT token 增加，monitorability 同步提高

Key finding 2: Large‑scale RL training does not markedly reduce monitorability

Using early GPT‑5.1 Codex Max and o3 RL training logs, the authors find:

Initially, monitorability rises with reward improvements.

Later, it plateaus, with only a slight dip in the middle, closely tied to CoT length (see Figure 10).

At current frontier scales, purely result‑oriented RL does not automatically cause CoT to hide itself.

图 10：两 frontier RL 训练 run 的 monitorability 走势

Key finding 3: Small models with high inference effort achieve higher monitorability

When controlling for capability (similar GPQA‑Diamond, AIME, CodeForces scores), small‑parameter models with higher inference effort can surpass large models in g‑mean², at the cost of increased FLOPs—dubbed the "monitorability tax".

图 11：相同能力水平下，小模型+长 CoT 的 monitorability 更高

Key finding 4: Weak monitors can be strengthened by follow‑up questions

If a monitor is weaker than the model, adding a generic follow‑up such as "Are you hiding anything?" and feeding the model’s second‑stage CoT back to the monitor improves performance:

One follow‑up raises g‑mean² by 4–7%.

Multiple sampled follow‑ups match the effect of specially designed queries.

Including the follow‑up CoT itself is more effective than only the answer.

图 13：追加 follow-up question 后，monitorability 全面提升

Bonus: Training a model that deliberately hides its CoT

Using an internal RL environment, the authors rewarded the model for making its CoT hard to predict, leading to extreme CoT compression (e.g., repeating the word "Weasel"). This caused a sharp drop in monitorability across all 13 evaluated tasks, while only degrading performance on coding‑impossible tasks, highlighting the importance of broad evaluation.

图 9：左为 generic obfuscation 训练，monitorability 骤降；右为 CoT token 数暴跌

Practical checklist for developers, regulators, and researchers

Developers should expose a "high inference effort" mode that always outputs CoT and use the open‑source 30+ evaluation suite to guard against RL‑induced CoT compression.

Regulators/Compliance can mandate a g‑mean² ≥ 0.7 threshold and require CoT submission for audit.

Security researchers are encouraged to expand Process‑type evaluations (theorem coverage, code walkthrough) and explore "weak‑monitor‑strong" scenarios.

https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2-codex/
https://openai.com/index/evaluating-chain-of-thought-monitorability/
https://cdn.openai.com/pdf/d57827c6-10bc-47fe-91aa-0fde55bd3901/monitoring-monitorability.pdf

large language models benchmark chain of thought AI safety GPT-5.2 monitorability

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.