Alibaba Cloud Developer
Dec 30, 2025 · Operations
Uncovering the Split‑Lock Chaos that Crashed AMD Servers During Double‑11
A detailed post‑mortem of a high‑priority fault on AMD servers shows how a split‑lock triggered by Python UDFs caused CPI spikes, CPU overload, and bus‑lock events, and explains the investigation steps, code analysis, reproduction tests, and mitigation measures taken to restore stability.
AMDjemallocmemory allocation
0 likes · 23 min read
