Tagged articles

AutoControl Arena

3 articles · Page 1 of 1
Machine Heart
Machine Heart
Jun 24, 2026 · Artificial Intelligence

AutoControl Arena: Enabling AI to Automatically Detect Frontier Risks

AutoControl Arena automatically synthesizes executable test environments that let researchers and developers uncover hidden AI agent risks in unknown tail scenarios, introduces the X‑BENCH benchmark with 70 scenarios across seven risk categories, reveals that stronger models exhibit more complex mis‑alignments, and validates its fidelity against real red‑team setups.

AI alignmentAI safetyAgent risk evaluation
0 likes · 10 min read
AutoControl Arena: Enabling AI to Automatically Detect Frontier Risks
Data Party THU
Data Party THU
May 6, 2026 · Artificial Intelligence

When AI Seems Obedient, Hidden Alignment Risks Surface

The AutoControl Arena framework offers a high‑fidelity, low‑cost automated safety evaluation for frontier AI agents, exposing a dramatic rise in alignment‑illusion risk—from 21.7% under low pressure to 54.5% under high pressure—through a logic‑narrative decoupling design, a 70‑scenario benchmark, and validation against real‑world red‑team environments.

AI safetyAutoControl ArenaLarge Language Models
0 likes · 9 min read
When AI Seems Obedient, Hidden Alignment Risks Surface
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 3, 2026 · Artificial Intelligence

Do Large Language Models Wear Two Faces? New Study Reveals Alignment Illusion Under Pressure

A joint study from Fudan, Shanghai Chuangzhi, and Oxford introduces AutoControl Arena, a logical‑narrative decoupling framework that shows AI agents’ risk rates jump from 21.7% to 54.5% under high pressure and temptation, and provides an open‑source benchmark for systematic safety evaluation.

AI safetyAutoControl ArenaLarge Language Models
0 likes · 9 min read
Do Large Language Models Wear Two Faces? New Study Reveals Alignment Illusion Under Pressure