Diagnosing and Resolving High CPU Usage in a Linux Gateway Process
This article walks through a real‑world remote debugging session where a high‑CPU issue in a gateway service was reproduced, analyzed with top, gstack, gcore, strace and gdb, and traced to a buffer overflow causing an infinite loop, then fixed.
A client reported that after a product upgrade the system became slow and CPU usage spiked dramatically. The issue was urgent, so the engineer connected remotely via GoToMeeting, reproduced the high‑CPU condition, and began collecting diagnostic data using Wireshark, gcore, gstack, strace and top.
Root Cause Identification
Analysis of the collected logs revealed that a buffer of 10 KB, allocated by the code author, was insufficient for a rare edge case. When the buffer filled, the program entered an infinite loop, driving CPU usage to 891 % across its threads. The fix was straightforward: increase the buffer size.
Step‑by‑Step Diagnostic Commands
Identify the offending process with top :
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14094 root 15 0 315m 10m 7308 S 891% 2.2 1:49.01 gateway
20642 root 17 0 17784 4148 2220 S 0.5 0.8 2:39.96 microdasys
...Inspect per‑thread CPU usage: # top -H -p 14094 The output shows 107 threads, with nine threads consuming most CPU. Thread 14086 is highlighted as a primary culprit.
PID USER PR NI VIRT RES SHR S %CPU MEM TIME+ COMMAND
14086 root 25 0 922m 914m 538m R 101 10.0 21:35.46 gateway
14087 root 25 0 922m 914m 538m R 101 10.0 10:50.22 gateway
...Obtain the stack trace of a specific thread with gstack : # gstack 14094 > gstack.log In gstack.log the stack for thread 14086 (thread 37) shows only two frames:
Thread 37 (Thread 0x4696ab90 (LWP 14086)):
#0 0x40000410 in __kernel_vsyscall ()
#1 0x40241f33 in poll () from /lib/i686/nosegneg/libc.so.6Dump the process memory with gcore : # gcore 14094 This creates core.14094 , a core file identical to one produced by a live crash.
Analyze system calls and their time consumption using strace : # strace -T -r -c -p 14094 The summary shows that poll accounts for 99.99 % of the time (22.68 seconds over 6 702 calls), confirming that the loop is stuck in a poll call.
% time seconds usecs/call calls errors syscall
99.99 22.683879 3385 6702 poll
...Debug the core file with gdb and switch to the problematic thread:
(gdb) gdb gateway core.14094
(gdb) thread 37
(gdb) where
#0 0x40000410 in __kernel_vsyscall ()
#1 0x40241f33 in poll () from /lib/i686/nosegneg/libc.so.6Using the detailed stack, variables can be inspected and correlated with source code to understand why poll is consuming excessive CPU.
Analysis Workflow
The reproducible workflow is: Process ID → Thread ID → Thread stack → System‑call timing statistics → Source‑code inspection . This systematic approach can be reused for similar performance incidents.
After increasing the buffer size, the high‑CPU loop disappeared, the client’s complaint was resolved, and the fix was delivered promptly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
