Why Does My Upgraded Chumby 8 Show 100% CPU Usage? Uncovering a Hidden Kernel Timer Bug
After upgrading a PXA166‑based Chumby 8 from Linux 2.6.28 to 6.x, the top command constantly reported 100% CPU usage, leading the author through profiling, kernel source analysis, procfs inspection, and a timer‑register sequencing bug that was finally fixed by adjusting the delay in the timer_read function.
1. Confirming the bug is not introduced by Linux 6.x
The author first rolled back to an older 3.13 kernel, which reproduced the same 100% CPU usage, proving the issue was not specific to the newest kernel version.
2. Understanding how top calculates CPU usage
By enabling CONFIG_PROFILING and using readprofile, the author observed that default_idle_call consumed most of the time, indicating the CPU was actually idle.
Investigation of /proc/stat showed that top reads several fields (user, nice, system, idle, iowait, irq, softirq, etc.) and computes percentages from the differences between successive reads.
Experiments on a desktop PC confirmed that the idle counter increases roughly 1000 units per 10 seconds (given a USER_HZ of 100), matching the expected behavior.
Running the same test on the Chumby showed almost no increase in the idle counter, explaining why top displayed 100% usage.
3. Applying the OLPC workaround
Research revealed that disabling CONFIG_NO_HZ (or adding the kernel command line nohz=off) fixed the problem on OLPC devices. Adding this option to the Chumby kernel immediately reduced the reported CPU usage to the correct idle percentage.
4. Tracing the root cause to a timer‑register read sequencing issue
The author traced the idle‑time calculation to get_cpu_idle_time_us, which ultimately calls timer_read in arch/arm/mach-mmp/time.c. The original implementation writes 1 to the CVWR register, loops for a fixed delay (100 iterations), then reads the register.
Documentation indicated that the timer value can be metastable, requiring either a double‑read verification or a CVWR capture with sufficient delay.
Replacing the delayed read with a direct register read ( __raw_readl(mmp_timer_base + TMR_CR(1))) made top report correct idle time. Increasing the delay to about 300–500 iterations also restored correct behavior, confirming the timing window was too short.
Further comparison with the 2.6.28 kernel showed it used timer 0 and performed a more robust delay, explaining why the older kernel behaved correctly.
5. The bug has existed since 2009
Historical Git analysis revealed the same buggy code was introduced when MMP support was added in 2009, with a FIXME comment noting the need for a longer delay.
The author submitted a patch in September 2022, which was eventually merged into Linux 6.2 and back‑ported to several 4.x/5.x kernels, eliminating the erroneous CPU‑usage reporting on the Chumby.
Overall, the article demonstrates a systematic approach to kernel debugging: reproducing the issue on older versions, profiling, reading procfs, tracing through kernel call stacks, and finally fixing a subtle hardware‑timer sequencing bug.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
