Operations 23 min read

Uncovering the Split‑Lock Chaos that Crashed AMD Servers During Double‑11

A detailed post‑mortem of a high‑priority fault on AMD servers shows how a split‑lock triggered by Python UDFs caused CPI spikes, CPU overload, and bus‑lock events, and explains the investigation steps, code analysis, reproduction tests, and mitigation measures taken to restore stability.

Alibaba Cloud Developer

Dec 30, 2025

Uncovering the Split‑Lock Chaos that Crashed AMD Servers During Double‑11

Problem Discovery

At the end of August, AMD‑based servers in the Alibaba group began showing abnormal CPI spikes: the cycles‑per‑instruction metric jumped from below 1 to values of 3‑4, causing CPU utilization of online containers to increase three‑to‑fourfold and degrading both online and offline workloads during the Double‑11 promotion period.

Phenomenon Observation

Monitoring revealed a universal rise in online CPI and CPU usage across all pods, while offline kata containers were throttled by the online load. Graphs showed that every core’s CPI surged simultaneously, and the issue was isolated to the newest generation of AMD hardware.

Root‑Cause Investigation

Top‑down analysis identified an extreme front‑end instruction fetch miss rate (L1I miss) and a blocked instruction dispatch, which slowed core execution and reduced L3 cache and memory accesses. Perf tracing with perf stat -e ls_locks.bus_lock captured bus‑lock events, and bpftrace showed that the problematic thread’s futex call used an address of 0xffffffff, confirming a split‑lock scenario.

Front‑end fetch was abnormal, L1I miss extremely high, instruction dispatch blocked, leading to overall slowdown.

Code Path Analysis

The problematic thread entered __lll_lock_wait_private, which ultimately called the futex system call. The assembly of this function was re‑implemented in C to illustrate its logic:

void __lll_lock_wait_private(int *lock, int val) {
    int expected = 2; // contended state
    if (val == expected) {
        syscall(SYS_futex, lock, FUTEX_WAIT|FUTEX_PRIVATE_FLAG, expected, NULL, NULL, 0);
    }
    while (__sync_val_compare_and_swap(lock, 0, expected) != 0) {
        syscall(SYS_futex, lock, FUTEX_WAIT|FUTEX_PRIVATE_FLAG, expected, NULL, NULL, 0);
    }
}

Further inspection of the glibc __libc_free path showed that the allocator derived an arena_for_chunk pointer from the freed address. Due to memory corruption, this pointer became 0xffffffff, feeding the split‑lock.

Python UDF Involvement

GDB analysis of the core dump revealed that the crash occurred inside the Python interpreter while freeing a PyListObject. The list’s 17th element was a PyString allocated via malloc, later freed by glibc’s free. Because the process mixed jemalloc (used by C++ code) and glibc’s ptmalloc (used by Python), the free operation crossed allocator boundaries.

static void list_dealloc(PyListObject *op) {
    Py_ssize_t i;
    PyObject_GC_UnTrack(op);
    Py_TRASHCAN_SAFE_BEGIN(op);
    if (op->ob_item != NULL) {
        i = Py_SIZE(op);
        while (--i >= 0) {
            Py_XDECREF(op->ob_item[i]); // crash here
        }
        PyMem_FREE(op->ob_item);
    }
    Py_TRASHCAN_SAFE_END(op);
}

Split‑Lock Mechanics

Intel calls cross‑cache‑line atomic operations “SplitLock”. When such an operation occurs, the CPU generates a bus lock that stalls all cores. AMD kernels currently lack split‑lock detection, so the issue manifested as severe performance degradation.

Reproduction

A minimal test program allocated a 64‑byte buffer, misaligned it to cross a cache line, and called __lll_lock_wait_private. On AMD hardware the whole system’s CPI rose to ~4, while Intel CPUs only showed the affected logical core’s CPI increase.

int main() {
    long long *a = malloc(sizeof(long long) * 8);
    int *x = (int *)((char*)a + 15); // cross cache line
    *x = 2;
    __lll_lock_wait_private(x);
    return 0;
}

Mitigation and Future Work

The ODPS team enabled isolation mode for problematic jobs, forcing all processes to use tcmalloc and avoiding the jemalloc/glibc mix. After a half‑month gray rollout, the issue largely disappeared. Additional steps include avoiding malloc_trim to keep __free_hook intact, and awaiting an AMD kernel patch that adds split‑lock detection.

References

Deep Dive into SplitLocks – Volcengine

TotalView Memory Debugging Requirements

AMD Split‑Lock Detection Patch

Best Practices for Avoiding Split‑Lock on Alibaba Cloud

jemalloc memory allocation AMD perf python udf split lock

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.