Operations 17 min read

Root Cause Analysis of Linux Kernel Hard Lockup on CPU 51

This article walks through a real Linux kernel hard lockup case, explaining what hard lockup is, analyzing stack traces and register values, identifying a spinlock contention on a per‑CPU runqueue, and showing how an inappropriate GFP flag caused interrupts to be enabled at the wrong time, leading to a deadlock and the eventual fix.

Tencent Architect
Tencent Architect
Tencent Architect
Root Cause Analysis of Linux Kernel Hard Lockup on CPU 51

Background

Business side reported a machine crash that generated a vmcore but showed no hardware anomalies. dmesg indicated a hard LOCKUP on CPU 51 causing a panic.

[4664383.183725] NMI watchdog: Watchdog detected hard LOCKUP on cpu 51
[4664383.183750] Call Trace:
[4664383.183750]  _raw_spin_lock+0x1f/0x30
[4664383.183750]  raw_spin_rq_lock_nested+0x13/0x20
[4664383.183750]  online_fair_sched_group+0x45/0x120
[4664383.183750]  sched_online_group+0xec/0x110
[4664383.183751]  sched_autogroup_create_attach+0xc2/0x1d0
[4664383.183751]  ksys_setsid+0xe9/0x110
[4664383.183751]  __ia32_sys_setsid+0xe/0x20
[4664383.183751]  do_syscall_64+0x47/0x140
[4664383.183752]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[4664383.183752] RIP: 0033:0x7f89d3b8afdb
[4664383.183754] Kernel panic - not syncing: Hard LOCKUP

Q1: What is hard lockup?

Linux defines two lockup states: soft lockup – the CPU stays in kernel mode too long, detected by a watchdog thread that checks timestamps; and hard lockup – the CPU stops responding to high‑resolution timer (hrtimer) interrupts, detected by periodic non‑maskable interrupts (NMI) that verify the timer counter is still increasing.

Q2: Why does CPU 51 stop responding to hrtimer interrupts?

Examining the stack on CPU 51 shows it was in online_fair_sched_group with interrupts disabled, waiting for raw_spin_rq_lock to be acquired, which eventually triggered the hard lockup.

crash> bt
PID: 1000877  TASK: ff1101f9fda14000  CPU: 51  COMMAND: "crond"
#0 [fffffe0000b66960] machine_kexec at ffffffff810625ff
#1 [fffffe0000b669b8] __crash_kexec at ffffffff8113bb72
#2 [fffffe0000b66a88] panic at ffffffff81c51a2b
#12 [fffffe0000b66ef0] end_repeat_nmi at ffffffff81e01400
   [exception RIP: native_queued_spin_lock_slowpath+330]
   RIP: ffffffff810f080a  RSP: ffa000003b62fe28  RFLAGS: 00000046   <=== interrupts disabled

RFLAGS 0x46 shows the interrupt flag (bit 9) is cleared.

Q3: What lock is involved?

Register analysis shows RDI = ff1101fb83a2ee40, which is the address of rq->__lock, the spinlock protecting the per‑CPU runqueue structure struct rq.

crash> struct rq -o
struct rq {
    [0] raw_spinlock_t __lock;   // first member
    [4] unsigned int nr_running;
    [8] unsigned int bt_nr_running;
    ...
}

The lock belongs to CPU 56’s runqueue, as confirmed by:

crash> p runqueues |grep ff1101fb83a2ee40
[56]: ff1101fb83a2ee40    <=== CPU 51 is waiting on CPU 56’s runqueue

Q4: Who holds the lock?

Stack trace of CPU 56 shows it is in the timer interrupt path, inside finish_task_switch and subsequently in perf_event_task_sched_in, which eventually calls kmem_cache_alloc with GFP_KERNEL. This allocation enables interrupts (via local_irq_enable()) while still holding the runqueue lock.

static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
    if (gfpflags_allow_blocking(flags))
        local_irq_enable();   // enables interrupts
    ...
    if (gfpflags_allow_blocking(flags))
        local_irq_disable();  // disables again
}

The NMI that arrives right after local_irq_enable() tries to acquire the same rq->__lock, causing a deadlock.

Q5: Why is interrupt enabled in the context of a context switch?

The path

finish_task_switch → perf_event_task_sched_in → intel_pmu_lbr_add → kmem_cache_alloc(GFP_KERNEL)

sets the GFP_KERNEL flag, which includes __GFP_DIRECT_RECLAIM. The allocator therefore enables interrupts before the lock is released.

Q6: How does this lead to a hard lockup?

When the interrupt is enabled while still holding the runqueue spinlock, the incoming timer interrupt also attempts to acquire the same lock, resulting in a circular wait and a hard lockup.

Fix

Changing the allocation flag from GFP_KERNEL to GFP_ATOMIC prevents interrupts from being enabled in this critical section.

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
@@ -700,7 +700,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
-    cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL);
+    cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_ATOMIC);
 }

The upstream fix instead moves the allocation out of the context switch path.

In summary, an inappropriate memory‑allocation flag caused interrupts to be enabled while a per‑CPU runqueue spinlock was held, leading to a deadlock and hard lockup. Adjusting the flag or restructuring the code eliminates the issue.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DebuggingKernelLinuxSchedulingSpinlockhard-lockupPerf Events
Tencent Architect
Written by

Tencent Architect

We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.