How to Diagnose and Fix Linux CPU 100% Issues with a Handy Shell Script
This article walks you through a systematic approach to identify the root cause of a Linux server's CPU hitting 100%, from pinpointing the high‑load process with top, tracing the responsible business code, to using a custom shell script that streamlines thread analysis and resolves the overload.
When a Linux server’s CPU usage spikes to 100%, a structured investigation can quickly reveal the underlying problem and prevent unnecessary hardware scaling.
1. Identify the High‑Load Process (PID)
Log into the server and run top to observe the overall load average and individual process CPU percentages. On an 8‑core machine, a load average above 8 indicates high load. In the example, process ID 682 shows a significant CPU share.
2. Locate the Specific Business Component
Use pwdx <PID> to retrieve the working directory of the process, which points to the responsible service (e.g., the data platform’s web service).
3. Pinpoint the Abnormal Thread and Code Line
The traditional four‑step method is:
Sort processes by CPU with top -o %CPU to get the max‑load PID.
List threads of that PID using top -Hp <PID>.
Convert the thread ID to hexadecimal: printf "0x%x" <TID>.
Run jstack <PID> | vim +/0x<hex‑tid> to view the stack trace.
Because this process is time‑critical, a faster tool show-busy-java-threads.sh was created to automate these steps and directly display the hottest Java threads.
Root‑Cause Analysis
The investigation revealed that a time‑utility method converting timestamps to formatted dates was invoked excessively. The method’s logic:
Abnormal method: Converts a timestamp to a specific date‑time format.
Upper‑level call: Calculates every second from midnight to the current time, formats it, and stores it in a set.
Logic layer: Real‑time report queries repeatedly call this method many times per request, causing massive CPU consumption.
For a 10 AM query, the method runs 10 × 60 × 60 × n times (36 000 × n), and the count grows linearly as the day progresses, leading to severe CPU waste.
Solution
After locating the hotspot, the method was simplified to compute current_seconds - midnight_seconds directly, eliminating the heavy formatting loop. The revised implementation reduced CPU usage by a factor of 30, restoring normal server performance.
Utility Script: show-busy-java-threads.sh
#!/bin/bash
# @Function
# Find out the highest cpu consumed threads of java, and print the stack of these threads.
# @Usage
# $ ./show-busy-java-threads.sh
# Options:
# -p, --pid find out the highest cpu consumed threads from the specifed java process, default from all java process.
# -c, --count set the thread count to show, default is 5
# -h, --help display this help and exit
readonly PROG=`basename $0`
readonly -a COMMAND_LINE=("$0" "$@")
usage() {
cat <<EOF
Usage: ${PROG} [OPTION]...
Find out the highest cpu consumed threads of java, and print the stack of these threads.
Example: ${PROG} -c 10
Options:
-p, --pid find out the highest cpu consumed threads from the specifed java process,
default from all java process.
-c, --count set the thread count to show, default is 5
-h, --help display this help and exit
EOF
exit $1
}
readonly ARGS=`getopt -n "${PROG}" -a -o c:p:h -l count:,pid:,help -- "$@"`
[ $? -ne 0 ] && usage 1
eval set -- "${ARGS}"
while true; do
case "$1" in
-c|--count)
count="$2"
shift 2
;;
-p|--pid)
pid="$2"
shift 2
;;
-h|--help)
usage
;;
--)
shift
break
;;
esac
done
count=${count:-5}
redEcho() { [ -c /dev/stdout ] && { echo -ne "\033[1;31m"; echo -n "$@"; echo -e "\033[0m"; } || echo "$@"; }
yellowEcho() { [ -c /dev/stdout ] && { echo -ne "\033[1;33m"; echo -n "$@"; echo -e "\033[0m"; } || echo "$@"; }
blueEcho() { [ -c /dev/stdout ] && { echo -ne "\033[1;36m"; echo -n "$@"; echo -e "\033[0m"; } || echo "$@"; }
# Ensure jstack is available
if ! which jstack &>/dev/null; then
[ -z "$JAVA_HOME" ] && { redEcho "Error: jstack not found on PATH!"; exit 1; }
[ ! -f "$JAVA_HOME/bin/jstack" ] && { redEcho "Error: jstack not found in JAVA_HOME!"; exit 1; }
[ ! -x "$JAVA_HOME/bin/jstack" ] && { redEcho "Error: jstack is not executable!"; exit 1; }
export PATH="$JAVA_HOME/bin:$PATH"
fi
readonly uuid=`date +%s`_${RANDOM}_$$
cleanupWhenExit() { rm /tmp/${uuid}_* &>/dev/null; }
trap "cleanupWhenExit" EXIT
printStackOfThreads() {
local line count=1
while IFS=" " read -a line; do
local pid=${line[0]}
local threadId=${line[1]}
local threadId0x="0x$(printf %x $threadId)"
local user=${line[2]}
local pcpu=${line[4]}
local jstackFile=/tmp/${uuid}_${pid}
[ ! -f "$jstackFile" ] && {
if [ "$user" == "$USER" ]; then
jstack $pid > $jstackFile
else
if [ $UID -eq 0 ]; then
sudo -u $user jstack $pid > $jstackFile
else
redEcho "[${count}] Fail to jstack Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process($pid) under user($user)."
redEcho "User of java process($user) is not current user($USER), need sudo to run again:"
yellowEcho " sudo ${COMMAND_LINE[@]}"
continue
fi
fi
}
blueEcho "[${count}] Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process($pid) under user($user):"
sed "/nid=${threadId0x} /,/^$/p" -n $jstackFile
done
}
ps -Leo pid,lwp,user,comm,pcpu --no-headers | {
[ -z "$pid" ] && awk '$4=="java"{print $0}' || awk -v pid=$pid '$1==pid,$4=="java"{print $0}'
} | sort -k5 -r -n | head -n $count | printStackOfThreadsBy applying the above steps and the script, the CPU overload caused by the time‑utility method was eliminated, and server performance returned to normal.
Linux Cloud Computing Practice
Welcome to Linux Cloud Computing Practice. We offer high-quality articles on Linux, cloud computing, DevOps, networking and related topics. Dive in and start your Linux cloud computing journey!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
