Operations 17 min read

Root Cause Analysis of Linux Kernel Hard Lockup on CPU 51

This article walks through a real Linux kernel hard lockup case, explaining what hard lockup is, analyzing stack traces and register values, identifying a spinlock contention on a per‑CPU runqueue, and showing how an inappropriate GFP flag caused interrupts to be enabled at the wrong time, leading to a deadlock and the eventual fix.

Tencent Architect
Tencent Architect
Tencent Architect
Root Cause Analysis of Linux Kernel Hard Lockup on CPU 51

Background

Business side reported a machine crash that generated a vmcore but showed no hardware anomalies. dmesg indicated a hard LOCKUP on CPU 51 causing a panic.

<code>[4664383.183725] NMI watchdog: Watchdog detected hard LOCKUP on cpu 51
[4664383.183750] Call Trace:
[4664383.183750]  _raw_spin_lock+0x1f/0x30
[4664383.183750]  raw_spin_rq_lock_nested+0x13/0x20
[4664383.183750]  online_fair_sched_group+0x45/0x120
[4664383.183750]  sched_online_group+0xec/0x110
[4664383.183751]  sched_autogroup_create_attach+0xc2/0x1d0
[4664383.183751]  ksys_setsid+0xe9/0x110
[4664383.183751]  __ia32_sys_setsid+0xe/0x20
[4664383.183751]  do_syscall_64+0x47/0x140
[4664383.183752]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[4664383.183752] RIP: 0033:0x7f89d3b8afdb
[4664383.183754] Kernel panic - not syncing: Hard LOCKUP</code>

Q1: What is hard lockup?

Linux defines two lockup states:

soft lockup

– the CPU stays in kernel mode too long, detected by a watchdog thread that checks timestamps; and

hard lockup

– the CPU stops responding to high‑resolution timer (hrtimer) interrupts, detected by periodic non‑maskable interrupts (NMI) that verify the timer counter is still increasing.

Q2: Why does CPU 51 stop responding to hrtimer interrupts?

Examining the stack on CPU 51 shows it was in

online_fair_sched_group

with interrupts disabled, waiting for

raw_spin_rq_lock

to be acquired, which eventually triggered the hard lockup.

<code>crash> bt
PID: 1000877  TASK: ff1101f9fda14000  CPU: 51  COMMAND: "crond"
#0 [fffffe0000b66960] machine_kexec at ffffffff810625ff
#1 [fffffe0000b669b8] __crash_kexec at ffffffff8113bb72
#2 [fffffe0000b66a88] panic at ffffffff81c51a2b
#12 [fffffe0000b66ef0] end_repeat_nmi at ffffffff81e01400
   [exception RIP: native_queued_spin_lock_slowpath+330]
   RIP: ffffffff810f080a  RSP: ffa000003b62fe28  RFLAGS: 00000046   <=== interrupts disabled</code>

RFLAGS 0x46 shows the interrupt flag (bit 9) is cleared.

Q3: What lock is involved?

Register analysis shows RDI = ff1101fb83a2ee40, which is the address of

rq->__lock

, the spinlock protecting the per‑CPU runqueue structure

struct rq

.

<code>crash> struct rq -o
struct rq {
    [0] raw_spinlock_t __lock;   // first member
    [4] unsigned int nr_running;
    [8] unsigned int bt_nr_running;
    ...
}</code>

The lock belongs to CPU 56’s runqueue, as confirmed by:

<code>crash> p runqueues |grep ff1101fb83a2ee40
[56]: ff1101fb83a2ee40    <=== CPU 51 is waiting on CPU 56’s runqueue</code>

Q4: Who holds the lock?

Stack trace of CPU 56 shows it is in the timer interrupt path, inside

finish_task_switch

and subsequently in

perf_event_task_sched_in

, which eventually calls

kmem_cache_alloc

with

GFP_KERNEL

. This allocation enables interrupts (via

local_irq_enable()

) while still holding the runqueue lock.

<code>static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
    if (gfpflags_allow_blocking(flags))
        local_irq_enable();   // enables interrupts
    ...
    if (gfpflags_allow_blocking(flags))
        local_irq_disable();  // disables again
}</code>

The NMI that arrives right after

local_irq_enable()

tries to acquire the same

rq->__lock

, causing a deadlock.

Q5: Why is interrupt enabled in the context of a context switch?

The path

finish_task_switch → perf_event_task_sched_in → intel_pmu_lbr_add → kmem_cache_alloc(GFP_KERNEL)

sets the

GFP_KERNEL

flag, which includes

__GFP_DIRECT_RECLAIM

. The allocator therefore enables interrupts before the lock is released.

Q6: How does this lead to a hard lockup?

When the interrupt is enabled while still holding the runqueue spinlock, the incoming timer interrupt also attempts to acquire the same lock, resulting in a circular wait and a hard lockup.

Fix

Changing the allocation flag from

GFP_KERNEL

to

GFP_ATOMIC

prevents interrupts from being enabled in this critical section.

<code>diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
@@ -700,7 +700,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
-    cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL);
+    cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_ATOMIC);
 }</code>

The upstream fix instead moves the allocation out of the context switch path.

In summary, an inappropriate memory‑allocation flag caused interrupts to be enabled while a per‑CPU runqueue spinlock was held, leading to a deadlock and hard lockup. Adjusting the flag or restructuring the code eliminates the issue.

debuggingKernelLinuxSchedulingspinlockhard-lockupperf-events
Tencent Architect
Written by

Tencent Architect

We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.