How to Quickly Diagnose and Fix High CPU Usage in a Data Platform
This guide walks through a real‑world incident where a data platform’s CPU spiked to 98.94%, showing step‑by‑step how to identify the high‑load process, pinpoint the offending Java thread, analyze the root cause in the time‑utility code, and implement a performance‑focused solution that reduced load by thirtyfold.
1. Problem Background
Yesterday afternoon we received an ops alert showing CPU utilization of the data platform server reached 98.94% and has been above 70% for a while. The business system is not high‑concurrency or CPU‑intensive, so the high usage likely stems from problematic business code rather than hardware limits.
2. Investigation Steps
2.1 Identify High‑Load Process (pid)
Log into the server and run top to view the current situation, then analyze.
By observing the load average (8‑core standard) we confirm high load.
Process with PID 682 shows a high CPU share.
2.2 Locate the Specific Business Service
Use pwdx with the PID to find the process path, then identify the owner and project.
The process corresponds to the data platform's web service.
2.3 Find the Abnormal Thread and Code Line
Traditional four‑step method: top -o pid – sort by load to find maxLoad(pid). top -Hp <PID> – find the thread PID with high load. printf "0x%x\n" <threadPID> – convert thread PID to hex for jstack. jstack <processPID> | vim +/0x<hexPID> – locate the stack trace.
To speed up online troubleshooting, the four steps can be wrapped into a script show-busy-java-threads.sh, as shown below.
The analysis points to a time‑utility method that consumes excessive CPU.
For urgent issues you may skip 2.1 and 2.2 and go directly to 2.3.
3. Root Cause Analysis
The high CPU usage is caused by a time‑utility class that converts timestamps to formatted date strings. The upper‑level code repeatedly calls this method for every second from midnight to the current time, leading to millions of calls per query, especially as the day progresses.
Problematic method logic: converts a timestamp to a specific date‑time format.
Upper‑level call: calculates all seconds from midnight to now, formats each, and stores them in a set.
Logic layer: used by the data platform's real‑time report queries, which invoke the method many times per request.
Thus a single query at 10 am performs 10 × 60 × 60 × n calls (36 000 × n), and the count grows linearly toward midnight, exhausting CPU resources.
4. Solution
After locating the issue, we reduced the number of calculations by simplifying the method. Instead of converting each second, we compute the current second minus the midnight second and use that value directly. The new implementation replaces the old method, and after deployment CPU load dropped by a factor of 30, returning to normal.
5. Summary
Code should be both functional and performance‑optimized; efficient implementations reflect higher engineering competence.
Conduct thorough code reviews and consider better alternatives.
Never ignore small details in production issues; a meticulous mindset drives continuous growth.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
