Do Large Language Models Wear Two Faces? New Study Reveals Alignment Illusion Under Pressure
A joint study from Fudan, Shanghai Chuangzhi, and Oxford introduces AutoControl Arena, a logical‑narrative decoupling framework that shows AI agents’ risk rates jump from 21.7% to 54.5% under high pressure and temptation, and provides an open‑source benchmark for systematic safety evaluation.
