Operations 7 min read

Diagnosing and Resolving High CPU Usage in a Linux Gateway Process

This article walks through a real‑world remote debugging session where a high‑CPU issue in a gateway service was reproduced, analyzed with top, gstack, gcore, strace and gdb, and traced to a buffer overflow causing an infinite loop, then fixed.

ITPUB
ITPUB
ITPUB
Diagnosing and Resolving High CPU Usage in a Linux Gateway Process

A client reported that after a product upgrade the system became slow and CPU usage spiked dramatically. The issue was urgent, so the engineer connected remotely via GoToMeeting, reproduced the high‑CPU condition, and began collecting diagnostic data using Wireshark, gcore, gstack, strace and top.

Root Cause Identification

Analysis of the collected logs revealed that a buffer of 10 KB, allocated by the code author, was insufficient for a rare edge case. When the buffer filled, the program entered an infinite loop, driving CPU usage to 891 % across its threads. The fix was straightforward: increase the buffer size.

Step‑by‑Step Diagnostic Commands

Identify the offending process with top :

PID USER   PR  NI   VIRT   RES   SHR S %CPU %MEM    TIME+ COMMAND
14094 root   15   0   315m   10m 7308 S 891%  2.2   1:49.01 gateway
20642 root   17   0   17784 4148 2220 S  0.5  0.8   2:39.96 microdasys
...

Inspect per‑thread CPU usage: # top -H -p 14094 The output shows 107 threads, with nine threads consuming most CPU. Thread 14086 is highlighted as a primary culprit.

PID USER   PR  NI   VIRT   RES   SHR S %CPU MEM    TIME+ COMMAND
14086 root   25   0   922m  914m 538m R 101 10.0  21:35.46 gateway
14087 root   25   0   922m  914m 538m R 101 10.0  10:50.22 gateway
...

Obtain the stack trace of a specific thread with gstack : # gstack 14094 > gstack.log In gstack.log the stack for thread 14086 (thread 37) shows only two frames:

Thread 37 (Thread 0x4696ab90 (LWP 14086)):
#0  0x40000410 in __kernel_vsyscall ()
#1  0x40241f33 in poll () from /lib/i686/nosegneg/libc.so.6

Dump the process memory with gcore : # gcore 14094 This creates core.14094 , a core file identical to one produced by a live crash.

Analyze system calls and their time consumption using strace : # strace -T -r -c -p 14094 The summary shows that poll accounts for 99.99 % of the time (22.68 seconds over 6 702 calls), confirming that the loop is stuck in a poll call.

% time   seconds   usecs/call   calls   errors   syscall
99.99   22.683879      3385     6702            poll
...

Debug the core file with gdb and switch to the problematic thread:

(gdb) gdb gateway core.14094
(gdb) thread 37
(gdb) where
#0  0x40000410 in __kernel_vsyscall ()
#1  0x40241f33 in poll () from /lib/i686/nosegneg/libc.so.6

Using the detailed stack, variables can be inspected and correlated with source code to understand why poll is consuming excessive CPU.

Analysis Workflow

The reproducible workflow is: Process ID → Thread ID → Thread stack → System‑call timing statistics → Source‑code inspection . This systematic approach can be reused for similar performance incidents.

After increasing the buffer size, the high‑CPU loop disappeared, the client’s complaint was resolved, and the fix was delivered promptly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsCPUgdbstrace
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.