Operations 12 min read

How to Diagnose and Fix Linux CPU 100% Issues with a Handy Shell Script

This article walks you through a systematic approach to identify the root cause of a Linux server's CPU hitting 100%, from pinpointing the high‑load process with top, tracing the responsible business code, to using a custom shell script that streamlines thread analysis and resolves the overload.

Linux Cloud Computing Practice

Sep 4, 2024

How to Diagnose and Fix Linux CPU 100% Issues with a Handy Shell Script

When a Linux server’s CPU usage spikes to 100%, a structured investigation can quickly reveal the underlying problem and prevent unnecessary hardware scaling.

1. Identify the High‑Load Process (PID)

Log into the server and run top to observe the overall load average and individual process CPU percentages. On an 8‑core machine, a load average above 8 indicates high load. In the example, process ID 682 shows a significant CPU share.

2. Locate the Specific Business Component

Use pwdx <PID> to retrieve the working directory of the process, which points to the responsible service (e.g., the data platform’s web service).

3. Pinpoint the Abnormal Thread and Code Line

The traditional four‑step method is:

Sort processes by CPU with top -o %CPU to get the max‑load PID.

List threads of that PID using top -Hp <PID>.

Convert the thread ID to hexadecimal: printf "0x%x" <TID>.

Run jstack <PID> | vim +/0x<hex‑tid> to view the stack trace.

Because this process is time‑critical, a faster tool show-busy-java-threads.sh was created to automate these steps and directly display the hottest Java threads.

Root‑Cause Analysis

The investigation revealed that a time‑utility method converting timestamps to formatted dates was invoked excessively. The method’s logic:

Abnormal method: Converts a timestamp to a specific date‑time format.

Upper‑level call: Calculates every second from midnight to the current time, formats it, and stores it in a set.

Logic layer: Real‑time report queries repeatedly call this method many times per request, causing massive CPU consumption.

For a 10 AM query, the method runs 10 × 60 × 60 × n times (36 000 × n), and the count grows linearly as the day progresses, leading to severe CPU waste.

Solution

After locating the hotspot, the method was simplified to compute current_seconds - midnight_seconds directly, eliminating the heavy formatting loop. The revised implementation reduced CPU usage by a factor of 30, restoring normal server performance.

Utility Script: show-busy-java-threads.sh

#!/bin/bash
# @Function
# Find out the highest cpu consumed threads of java, and print the stack of these threads.
# @Usage
#   $ ./show-busy-java-threads.sh
# Options:
#   -p, --pid       find out the highest cpu consumed threads from the specifed java process, default from all java process.
#   -c, --count     set the thread count to show, default is 5
#   -h, --help      display this help and exit

readonly PROG=`basename $0`
readonly -a COMMAND_LINE=("$0" "$@")

usage() {
    cat <<EOF
Usage: ${PROG} [OPTION]...
Find out the highest cpu consumed threads of java, and print the stack of these threads.
Example: ${PROG} -c 10

Options:
    -p, --pid       find out the highest cpu consumed threads from the specifed java process,
                    default from all java process.
    -c, --count     set the thread count to show, default is 5
    -h, --help      display this help and exit
EOF
    exit $1
}

readonly ARGS=`getopt -n "${PROG}" -a -o c:p:h -l count:,pid:,help -- "$@"`
[ $? -ne 0 ] && usage 1
eval set -- "${ARGS}"

while true; do
    case "$1" in
        -c|--count)
            count="$2"
            shift 2
            ;;
        -p|--pid)
            pid="$2"
            shift 2
            ;;
        -h|--help)
            usage
            ;;
        --)
            shift
            break
            ;;
    esac
done
count=${count:-5}

redEcho() { [ -c /dev/stdout ] && { echo -ne "\033[1;31m"; echo -n "$@"; echo -e "\033[0m"; } || echo "$@"; }
yellowEcho() { [ -c /dev/stdout ] && { echo -ne "\033[1;33m"; echo -n "$@"; echo -e "\033[0m"; } || echo "$@"; }
blueEcho() { [ -c /dev/stdout ] && { echo -ne "\033[1;36m"; echo -n "$@"; echo -e "\033[0m"; } || echo "$@"; }

# Ensure jstack is available
if ! which jstack &>/dev/null; then
    [ -z "$JAVA_HOME" ] && { redEcho "Error: jstack not found on PATH!"; exit 1; }
    [ ! -f "$JAVA_HOME/bin/jstack" ] && { redEcho "Error: jstack not found in JAVA_HOME!"; exit 1; }
    [ ! -x "$JAVA_HOME/bin/jstack" ] && { redEcho "Error: jstack is not executable!"; exit 1; }
    export PATH="$JAVA_HOME/bin:$PATH"
fi

readonly uuid=`date +%s`_${RANDOM}_$$
cleanupWhenExit() { rm /tmp/${uuid}_* &>/dev/null; }
trap "cleanupWhenExit" EXIT

printStackOfThreads() {
    local line count=1
    while IFS=" " read -a line; do
        local pid=${line[0]}
        local threadId=${line[1]}
        local threadId0x="0x$(printf %x $threadId)"
        local user=${line[2]}
        local pcpu=${line[4]}
        local jstackFile=/tmp/${uuid}_${pid}
        [ ! -f "$jstackFile" ] && {
            if [ "$user" == "$USER" ]; then
                jstack $pid > $jstackFile
            else
                if [ $UID -eq 0 ]; then
                    sudo -u $user jstack $pid > $jstackFile
                else
                    redEcho "[${count}] Fail to jstack Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process($pid) under user($user)."
                    redEcho "User of java process($user) is not current user($USER), need sudo to run again:"
                    yellowEcho "    sudo ${COMMAND_LINE[@]}"
                    continue
                fi
            fi
        }
        blueEcho "[${count}] Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process($pid) under user($user):"
        sed "/nid=${threadId0x} /,/^$/p" -n $jstackFile
    done
}

ps -Leo pid,lwp,user,comm,pcpu --no-headers | {
    [ -z "$pid" ] && awk '$4=="java"{print $0}' || awk -v pid=$pid '$1==pid,$4=="java"{print $0}'
} | sort -k5 -r -n | head -n $count | printStackOfThreads

By applying the above steps and the script, the CPU overload caused by the time‑utility method was eliminated, and server performance returned to normal.

Performance operations CPU ShellScript

Written by

Linux Cloud Computing Practice

Welcome to Linux Cloud Computing Practice. We offer high-quality articles on Linux, cloud computing, DevOps, networking and related topics. Dive in and start your Linux cloud computing journey!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.