Operations 7 min read

How to Quickly Diagnose and Fix High CPU Usage in a Data Platform

This guide walks through a real‑world incident where a data platform’s CPU spiked to 98.94%, showing step‑by‑step how to identify the high‑load process, pinpoint the offending Java thread, analyze the root cause in the time‑utility code, and implement a performance‑focused solution that reduced load by thirtyfold.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Quickly Diagnose and Fix High CPU Usage in a Data Platform

1. Problem Background

Yesterday afternoon we received an ops alert showing CPU utilization of the data platform server reached 98.94% and has been above 70% for a while. The business system is not high‑concurrency or CPU‑intensive, so the high usage likely stems from problematic business code rather than hardware limits.

2. Investigation Steps

2.1 Identify High‑Load Process (pid)

Log into the server and run top to view the current situation, then analyze.

By observing the load average (8‑core standard) we confirm high load.

Process with PID 682 shows a high CPU share.

2.2 Locate the Specific Business Service

Use pwdx with the PID to find the process path, then identify the owner and project.

The process corresponds to the data platform's web service.

2.3 Find the Abnormal Thread and Code Line

Traditional four‑step method: top -o pid – sort by load to find maxLoad(pid). top -Hp <PID> – find the thread PID with high load. printf "0x%x\n" <threadPID> – convert thread PID to hex for jstack. jstack <processPID> | vim +/0x<hexPID> – locate the stack trace.

To speed up online troubleshooting, the four steps can be wrapped into a script show-busy-java-threads.sh, as shown below.

The analysis points to a time‑utility method that consumes excessive CPU.

For urgent issues you may skip 2.1 and 2.2 and go directly to 2.3.

3. Root Cause Analysis

The high CPU usage is caused by a time‑utility class that converts timestamps to formatted date strings. The upper‑level code repeatedly calls this method for every second from midnight to the current time, leading to millions of calls per query, especially as the day progresses.

Problematic method logic: converts a timestamp to a specific date‑time format.

Upper‑level call: calculates all seconds from midnight to now, formats each, and stores them in a set.

Logic layer: used by the data platform's real‑time report queries, which invoke the method many times per request.

Thus a single query at 10 am performs 10 × 60 × 60 × n calls (36 000 × n), and the count grows linearly toward midnight, exhausting CPU resources.

4. Solution

After locating the issue, we reduced the number of calculations by simplifying the method. Instead of converting each second, we compute the current second minus the midnight second and use that value directly. The new implementation replaces the old method, and after deployment CPU load dropped by a factor of 30, returning to normal.

5. Summary

Code should be both functional and performance‑optimized; efficient implementations reflect higher engineering competence.

Conduct thorough code reviews and consider better alternatives.

Never ignore small details in production issues; a meticulous mindset drives continuous growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Optimizationbackend operationsLinux monitoringJava profilingCPU troubleshooting
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.