How to Diagnose Kernel Panic: A Step‑by‑Step Guide to Finding the Root Cause
This article explains what a Linux kernel panic is, enumerates common hardware and driver causes, walks through the panic() function internals, and provides a practical troubleshooting workflow with log analysis, debugging tools, and a concrete driver example to help operators quickly locate and fix the underlying fault.
1. Understanding Kernel Panic
A kernel panic is the Linux kernel’s protective response to an unrecoverable error such as hardware failure, corrupted core data structures, or a buggy driver. When triggered, the kernel halts all normal operation, prints detailed diagnostics (error type, CPU registers, call stack, error code), and forces a reboot.
Compared with less severe errors, an Oops records a non‑fatal exception and allows the system to continue, while a user‑space crash only terminates the offending application. Only a panic stops the entire system, requiring a manual or automatic restart.
2. Common Panic Triggers
2.1 Hardware Issues
Memory errors: physical defects or ECC failures cause corrupted reads/writes, leading the kernel to abort.
CPU overheating or defects: excessive temperature or silicon faults produce unstable instruction execution.
Disk I/O failures: bad sectors prevent reading critical system files, forcing a panic.
2.2 Driver Problems
Invalid pointer dereferences are a frequent driver bug. The following driver snippet deliberately writes to a NULL pointer, illustrating the fault:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
static int __init my_driver_init(void)
{
int *ptr = NULL; // define a null pointer
*ptr = 10; // illegal write triggers a panic
return 0;
}
static void __exit my_driver_exit(void)
{
printk(KERN_INFO "My driver removed
");
}
module_init(my_driver_init);
module_exit(my_driver_exit);
MODULE_LICENSE("GPL");Similar issues arise from interrupt‑handler mistakes, memory leaks, or out‑of‑bounds accesses, all of which can corrupt kernel state and cause a panic.
2.3 Kernel Logic Errors
Deadlocks: multiple kernel threads hold locks while waiting for each other, halting progress.
Hung tasks: a task stuck in an uninterruptible sleep or infinite loop consumes resources and blocks the scheduler.
3. Deep Dive into Panic Mechanics
3.1 Panic Trigger Path
When a fatal exception occurs (e.g., a NULL pointer dereference), the kernel generates an Oops, records the fault, and, if the context is unrecoverable (such as in interrupt context), escalates to a panic.
3.2 panic() Function Flow
Disable local interrupts – prevents new interrupts from interfering with panic handling.
Acquire panic lock – ensures only one CPU processes the panic. If spin_trylock(&panic_lock) fails, the CPU calls panic_smp_self_stop() to halt itself.
Print error information – formats the panic message and outputs it via pr_emerg().
Stop other CPUs – smp_send_stop() signals all other CPUs to cease execution.
Notify registered modules – the kernel walks the panic_notifier_list and calls each notifier’s callback.
Dump kernel logs – kmsg_dump(KMSG_DUMP_PANIC) saves the log buffer for post‑mortem analysis.
Trigger kdump (if configured) – crash_kexec() boots a small capture kernel to write a vmcore file.
Keyboard blink and reboot – if panic_timeout > 0, the kernel flashes the keyboard LEDs and eventually calls emergency_restart().
4. Practical Troubleshooting Workflow
4.1 Identify Panic Symptoms
Screen shows a panic message such as "Kernel panic - not syncing: Fatal exception in interrupt".
System becomes completely unresponsive.
Automatic reboot may occur.
Log files like /proc/last_kmsg or /sys/fs/pstore are generated.
4.2 Retrieve and Analyse Logs
cat /proc/last_kmsg
cat /sys/fs/pstore/console-ramoops-0Search for keywords:
grep -i "panic" /proc/last_kmsg
grep -i "Oops" /proc/last_kmsg4.3 Use Debugging Tools
addr2line -e vmlinux -f 0xffffffff810a1b2c– map an address to source file and line. objdump -D -M intel my_module.ko – disassemble a kernel module. gdb vmlinux – set breakpoints and inspect state. crash vmlinux vmcore – analyze a kdump crash dump (commands: bt, log, ps).
5. Case Study: Real‑World Panic
A high‑traffic web server in a data center experienced repeated panics under load. Logs showed a NULL‑pointer dereference in my_buggy_function on CPU 0, PID 1.
[123.456789] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[...]
[123.456803] Kernel panic - not syncing: Fatal exceptionUsing addr2line the address resolved to /usr/src/linux/drivers/misc/my_driver.c:42, confirming the buggy driver code shown earlier.
5.1 Fix Implementation
The fix replaces the NULL pointer with a properly allocated buffer and adds a NULL check:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/slab.h>
static int __init my_driver_init(void)
{
int *ptr = kmalloc(sizeof(int), GFP_KERNEL);
if (ptr) {
*ptr = 10;
kfree(ptr);
}
return 0;
}
static void __exit my_driver_exit(void)
{
printk(KERN_INFO "My driver removed
");
}
module_init(my_driver_init);
module_exit(my_driver_exit);
MODULE_LICENSE("GPL");Uses kmalloc to obtain valid memory.
Checks the allocation result before dereferencing.
Frees the memory with kfree to avoid leaks.
5.2 Preventive Measures
Adopt rigorous code review focusing on null‑pointer checks, memory‑access safety, and boundary validation; add extensive kernel‑level logging; keep the kernel up‑to‑date with security patches; and regularly monitor hardware health (memory, disks, CPU) to reduce recurrence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
