Root Cause Analysis of Linux Kernel Hard Lockup on CPU 51
This article walks through a real Linux kernel hard lockup case, explaining what hard lockup is, analyzing stack traces and register values, identifying a spinlock contention on a per‑CPU runqueue, and showing how an inappropriate GFP flag caused interrupts to be enabled at the wrong time, leading to a deadlock and the eventual fix.
Background
Business side reported a machine crash that generated a vmcore but showed no hardware anomalies. dmesg indicated a hard LOCKUP on CPU 51 causing a panic.
[4664383.183725] NMI watchdog: Watchdog detected hard LOCKUP on cpu 51
[4664383.183750] Call Trace:
[4664383.183750] _raw_spin_lock+0x1f/0x30
[4664383.183750] raw_spin_rq_lock_nested+0x13/0x20
[4664383.183750] online_fair_sched_group+0x45/0x120
[4664383.183750] sched_online_group+0xec/0x110
[4664383.183751] sched_autogroup_create_attach+0xc2/0x1d0
[4664383.183751] ksys_setsid+0xe9/0x110
[4664383.183751] __ia32_sys_setsid+0xe/0x20
[4664383.183751] do_syscall_64+0x47/0x140
[4664383.183752] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[4664383.183752] RIP: 0033:0x7f89d3b8afdb
[4664383.183754] Kernel panic - not syncing: Hard LOCKUPQ1: What is hard lockup?
Linux defines two lockup states: soft lockup – the CPU stays in kernel mode too long, detected by a watchdog thread that checks timestamps; and hard lockup – the CPU stops responding to high‑resolution timer (hrtimer) interrupts, detected by periodic non‑maskable interrupts (NMI) that verify the timer counter is still increasing.
Q2: Why does CPU 51 stop responding to hrtimer interrupts?
Examining the stack on CPU 51 shows it was in online_fair_sched_group with interrupts disabled, waiting for raw_spin_rq_lock to be acquired, which eventually triggered the hard lockup.
crash> bt
PID: 1000877 TASK: ff1101f9fda14000 CPU: 51 COMMAND: "crond"
#0 [fffffe0000b66960] machine_kexec at ffffffff810625ff
#1 [fffffe0000b669b8] __crash_kexec at ffffffff8113bb72
#2 [fffffe0000b66a88] panic at ffffffff81c51a2b
#12 [fffffe0000b66ef0] end_repeat_nmi at ffffffff81e01400
[exception RIP: native_queued_spin_lock_slowpath+330]
RIP: ffffffff810f080a RSP: ffa000003b62fe28 RFLAGS: 00000046 <=== interrupts disabledRFLAGS 0x46 shows the interrupt flag (bit 9) is cleared.
Q3: What lock is involved?
Register analysis shows RDI = ff1101fb83a2ee40, which is the address of rq->__lock, the spinlock protecting the per‑CPU runqueue structure struct rq.
crash> struct rq -o
struct rq {
[0] raw_spinlock_t __lock; // first member
[4] unsigned int nr_running;
[8] unsigned int bt_nr_running;
...
}The lock belongs to CPU 56’s runqueue, as confirmed by:
crash> p runqueues |grep ff1101fb83a2ee40
[56]: ff1101fb83a2ee40 <=== CPU 51 is waiting on CPU 56’s runqueueQ4: Who holds the lock?
Stack trace of CPU 56 shows it is in the timer interrupt path, inside finish_task_switch and subsequently in perf_event_task_sched_in, which eventually calls kmem_cache_alloc with GFP_KERNEL. This allocation enables interrupts (via local_irq_enable()) while still holding the runqueue lock.
static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
if (gfpflags_allow_blocking(flags))
local_irq_enable(); // enables interrupts
...
if (gfpflags_allow_blocking(flags))
local_irq_disable(); // disables again
}The NMI that arrives right after local_irq_enable() tries to acquire the same rq->__lock, causing a deadlock.
Q5: Why is interrupt enabled in the context of a context switch?
The path
finish_task_switch → perf_event_task_sched_in → intel_pmu_lbr_add → kmem_cache_alloc(GFP_KERNEL)sets the GFP_KERNEL flag, which includes __GFP_DIRECT_RECLAIM. The allocator therefore enables interrupts before the lock is released.
Q6: How does this lead to a hard lockup?
When the interrupt is enabled while still holding the runqueue spinlock, the incoming timer interrupt also attempts to acquire the same lock, resulting in a circular wait and a hard lockup.
Fix
Changing the allocation flag from GFP_KERNEL to GFP_ATOMIC prevents interrupts from being enabled in this critical section.
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
@@ -700,7 +700,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
- cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_KERNEL);
+ cpuc->lbr_xsave = kmem_cache_alloc(kmem_cache, GFP_ATOMIC);
}The upstream fix instead moves the allocation out of the context switch path.
In summary, an inappropriate memory‑allocation flag caused interrupts to be enabled while a per‑CPU runqueue spinlock was held, leading to a deadlock and hard lockup. Adjusting the flag or restructuring the code eliminates the issue.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Architect
We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
