Operations 9 min read

Why CPU Monitoring Shows 0% or 100% Spikes and How Hot Patches Fixed It

The article investigates intermittent CPU usage spikes on Linux servers caused by a kernel cputime bug, explains the root‑cause analysis, describes a cold patch applied to newer kernels, and details a hot‑patch solution that safely resolves the issue across thousands of production machines.

UCloud Tech
UCloud Tech
UCloud Tech
Why CPU Monitoring Shows 0% or 100% Spikes and How Hot Patches Fixed It

Problem Phenomenon

Initially, operations staff noticed occasional CPU usage spikes on a few hosts where the Redis process alternated between 0% and 100% CPU, sometimes showing 300% in top. The issue was rare (only a few cases among tens of thousands of machines) but hard to catch.

Problem Analysis

The anomalies stem from the same data source: /proc/pid/stat fields utime and stime. Their updates were delayed for minutes, causing large jumps, while normal processes update every few seconds.

After eliminating monitoring logic, I/O load, and call‑path bottlenecks, the root cause was identified as a bug in the 4.1 Linux kernel’s CPU time accounting.

cputime Statistics Logic

In cputime_adjust(), if utime + stime >= rtime the function exits without updating utime and stime. The kernel option CONFIG_VIRT_CPU_ACCOUNTING_GEN makes utime and stime increase monotonically, while rtime (runtime) reflects actual scheduled CPU time. When utime+stime stays larger than rtime, the stats stop updating until rtime catches up.

Cold Patch

A patch in kernel sched/cputime.c (available from kernel 4.3 onward) ensures stime + utime = rtime. The patch was back‑ported to the 4.1 kernel, eliminating the 0%/100% swing and producing smooth CPU usage values.

Hot Patch

Because many existing servers could not be upgraded, a hot‑patch was created to modify the runtime calculation without rebooting. The hot‑patch adds a spinlock to the relevant structure, handles allocation for existing instances, and intercepts code paths that use the new member.

Extensive load/unload testing (millions of cycles) showed no memory leaks and stable operation.

Verification

Verification involved three steps:

Stability: Deploy the hot‑patch on a few machines, then on 500 important machines for several days.

Correctness: On a problematic machine, compare utime+stime with rtime. When rtime exceeds the sum, the new hot‑patch logic runs, and the values stay synchronized.

Full rollout: Gradually apply the hot‑patch across the production fleet, confirming the issue is fully resolved.

Summary

The cputime accounting bug caused misleading CPU usage spikes, potentially leading to unnecessary resource scaling and confusion for developers and operators. By applying a back‑ported cold patch and a carefully engineered hot‑patch, the issue was permanently fixed without downtime, demonstrating a practical approach to kernel‑level troubleshooting and live patching.

operationsLinuxCPU Monitoringkernel bughot-patchcputime
UCloud Tech
Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.