Operations 14 min read

Understanding and Implementing Hungtask Detection in the Linux Kernel

The article explains Linux hung‑task detection, detailing both a system‑wide polling method that compares unchanged D‑state task context‑switch counts and a watchdog approach for critical processes, describes kernel implementations, analysis of real‑world hang cases, and emphasizes log analysis and parameter tuning to prevent system hangs.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Understanding and Implementing Hungtask Detection in the Linux Kernel

The article introduces the phenomenon of system hangs in Linux, where tasks may become unresponsive, stay in uninterruptible sleep (D state), or cause important processes such as Android's systemserver to stall. It focuses on detection methods for long‑lasting D‑state tasks, known as the hungtask detection mechanism, and notes that recovery typically involves a full system reboot when an anomaly is found.

Hungtask detection principles and workflow

Two main approaches are described:

A) Periodically poll all tasks in the system, compare the number of voluntary and involuntary context switches of tasks that are in D state between two intervals, and flag a task as hung if the counts remain unchanged. This method can be refined to ignore tasks that are harmless when hung, but it is crucial for important processes (e.g., systemserver, surfaceflinger) whose hang directly impacts user experience. It also monitors tasks that stay in I/O wait for an extended period.

B) Monitor only a set of important processes. Each important process performs a watchdog‑like operation within a defined time window; failure to “feed the dog” indicates that the process is hung.

The article then details the implementation of both approaches, referencing open‑source kernel files such as kernel/hung_task.c (default implementation) and vendor‑specific extensions like drivers/soc/qcom/hung_task_enh.c . Configuration options are defined in Kconfig files (e.g., lib/Kconfig.debug , drivers/soc/qcom/Kconfig ).

Code analysis – polling all tasks

The analysis explains how panic notifiers, power‑management notifiers, and a dedicated kernel thread (watchdog) cooperate to detect hung tasks. Key steps include:

Registering a panic_block notifier to abort detection if the system panics.

Registering a hungtask_pm_notify_nb notifier to track suspend/resume events.

Running a kernel thread that periodically checks tasks based on sysctl_hung_task_timeout_secs and sysctl_hung_task_check_interval_secs .

Limiting the number of tasks examined per interval, using RCU lock break to avoid long grace periods, and employing vendor hooks (e.g., qcom_before_check_tasks , qcom_check_tasks_done ) to apply GKI‑compliant policies.

The detection logic skips tasks that are frozen, involved in vfork, or whose context‑switch counters have changed, indicating they are not truly hung.

Code analysis – focusing on important processes

This part examines the MediaTek hang monitor implementation ( drivers/misc/mediatek/monitor_hang/hang_detect.c ). It registers a misc device ( RT_Monitor ) whose write/ioctl interface acts like a watchdog “kick”. Two kernel threads are created:

hang_detect – the main detection thread that decrements a counter every HD_INTER seconds; when the counter reaches zero, it triggers a dump or a BUG‑induced reboot.

hang_detect1 – dumps system state when a hang is detected; a possible hang_detect2 thread handles repeated failures.

The detection flow involves decreasing hang_detect_counter , waking up the dump thread, and eventually forcing a reboot if the system remains hung.

Problem analysis

The article discusses how to interpret kernel logs, task stack traces, and ramdump files to pinpoint the root cause of hung tasks. Common causes include memory shortage, allocation failures, UFS or filesystem errors, lock deadlocks, and interrupt storms. Two real‑world cases are presented:

Random crashes during power‑on/off testing, traced to an audio amplifier interrupt storm that caused many tasks to block on I/O.

Crashes during Monkey testing, where surfaceflinger hung while waiting for a mutex held by kworker/u16:14 , ultimately linked to RPM communication failures and memory allocation issues.

In both cases, adjusting kernel parameters (e.g., memory reclamation settings) mitigated the problems.

Conclusion

The article summarizes hungtask detection techniques, their implementation in the Linux kernel, and typical reasons for hung tasks. It emphasizes the importance of monitoring task states, analyzing kernel logs and ramdumps, and tuning system parameters to resolve hangs.

AndroidLinuxsystem monitoringkernel debugginghungtaskprocess hang
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.