Operations 7 min read

Why Did My Data Platform’s CPU Spike to 98%? A Step‑by‑Step Debugging Guide

This article walks through a real‑world incident where a data platform’s CPU usage surged to 98%, detailing how to pinpoint the high‑load process, trace the offending Java thread, uncover a time‑utility method bottleneck, and apply a concise fix that reduced load by thirtyfold.

Efficient Ops

Mar 12, 2018

Why Did My Data Platform’s CPU Spike to 98%? A Step‑by‑Step Debugging Guide

1. Problem Background

Yesterday afternoon an operations alert reported that a data‑platform server’s CPU utilization had reached 98.94% and had been staying above 70% for a while. Although the system is not a high‑concurrency or CPU‑intensive application, such a high figure suggested a code‑level issue rather than a hardware bottleneck.

2. Investigation Approach

2.1 Identify High‑Load Process (PID)

Log into the server and run top to observe the current load. The load average and an 8‑core baseline confirmed high load, and process ID 682 showed a large CPU share.

2.2 Identify the Abnormal Business Process

Use pwdx with the PID to locate the process directory, revealing the responsible team and project. The process was identified as the data platform’s web service.

2.3 Locate Abnormal Thread and Code Line

The traditional four‑step method is:

1. top -o pid // sort by load to find maxLoad(pid)

2. top -Hp <pid> // find the heavy thread PID

3. printf "0x%x
" <thread_pid> // convert thread PID to hex

4. jstack <pid> | vim +/0x... // search the hex thread in the jstack log

Because this is time‑critical in production, a script show-busy-java-threads.sh (originally shared by a colleague) automates these steps.

The investigation revealed that a time‑utility method was consuming excessive CPU.

3. Root Cause Analysis

The offending method converts timestamps to formatted date‑time strings. It is called repeatedly to compute the number of seconds from midnight to the current time, and the result set is only used for its size. In a real‑time reporting query, this method is invoked n times per query, leading to 36,000 × n calculations at 10 AM, and the count grows linearly toward midnight, causing massive CPU waste.

4. Solution

After pinpointing the issue, the calculation was simplified: instead of converting each timestamp, compute current_seconds - midnight_seconds directly and replace the original method calls. After deployment, CPU load dropped by about 30× and returned to normal levels.

5. Summary

1) Code performance matters as much as functional correctness; efficient, elegant implementations are a hallmark of strong engineers.

2) Conduct thorough code reviews and continuously seek better implementations.

3) Never overlook small details in production incidents; a meticulous, inquisitive mindset drives technical excellence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Operations CPU Performance debugging

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.