Why Did My Data Platform’s CPU Spike to 98%? A Step‑by‑Step Debugging Guide
This article walks through a real‑world incident where a data platform’s CPU usage surged to 98%, detailing how to pinpoint the high‑load process, trace the offending Java thread, uncover a time‑utility method bottleneck, and apply a concise fix that reduced load by thirtyfold.
1. Problem Background
Yesterday afternoon an operations alert reported that a data‑platform server’s CPU utilization had reached 98.94% and had been staying above 70% for a while. Although the system is not a high‑concurrency or CPU‑intensive application, such a high figure suggested a code‑level issue rather than a hardware bottleneck.
2. Investigation Approach
2.1 Identify High‑Load Process (PID)
Log into the server and run top to observe the current load. The load average and an 8‑core baseline confirmed high load, and process ID 682 showed a large CPU share.
2.2 Identify the Abnormal Business Process
Use pwdx with the PID to locate the process directory, revealing the responsible team and project. The process was identified as the data platform’s web service.
2.3 Locate Abnormal Thread and Code Line
The traditional four‑step method is:
1. top -o pid // sort by load to find maxLoad(pid) 2. top -Hp <pid> // find the heavy thread PID 3. printf "0x%x
" <thread_pid> // convert thread PID to hex 4. jstack <pid> | vim +/0x... // search the hex thread in the jstack logBecause this is time‑critical in production, a script show-busy-java-threads.sh (originally shared by a colleague) automates these steps.
The investigation revealed that a time‑utility method was consuming excessive CPU.
3. Root Cause Analysis
The offending method converts timestamps to formatted date‑time strings. It is called repeatedly to compute the number of seconds from midnight to the current time, and the result set is only used for its size. In a real‑time reporting query, this method is invoked n times per query, leading to 36,000 × n calculations at 10 AM, and the count grows linearly toward midnight, causing massive CPU waste.
4. Solution
After pinpointing the issue, the calculation was simplified: instead of converting each timestamp, compute current_seconds - midnight_seconds directly and replace the original method calls. After deployment, CPU load dropped by about 30× and returned to normal levels.
5. Summary
1) Code performance matters as much as functional correctness; efficient, elegant implementations are a hallmark of strong engineers.
2) Conduct thorough code reviews and continuously seek better implementations.
3) Never overlook small details in production incidents; a meticulous, inquisitive mindset drives technical excellence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
