Operations 41 min read

Master Linux Server Performance Troubleshooting: A Complete Step‑by‑Step Guide

This comprehensive guide walks Linux system administrators through a systematic performance‑troubleshooting workflow, covering CPU, memory, disk I/O, and network analysis with concrete commands, metrics, common bottleneck causes, real‑world case studies, and practical optimization recommendations.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Linux Server Performance Troubleshooting: A Complete Step‑by‑Step Guide

Foundations of Linux Performance Investigation

Performance problems fall into four categories:

CPU‑intensive : high load, low I/O wait, e.g., heavy calculations, encryption, regex backtracking.

Memory‑intensive : decreasing free memory, rising swap, e.g., memory leaks, oversized JVM heap.

I/O‑intensive : massive disk reads/writes, high I/O wait, e.g., log flooding, large DB operations.

Network‑intensive : bandwidth saturation or abnormal connection counts, e.g., DDoS, aggressive crawlers.

The recommended investigation order is “overall first, then local; CPU first, then memory; I/O and network together”. The steps are:

Quick load overview: top or uptime. Load average > CPU core count indicates contention.

System‑wide metrics: vmstat 2 5 (focus on r , b , wa , si/so ).

Disk I/O check: iostat -x 2 5.

Memory check: free -h (pay attention to the available column).

Network check: netstat -s or ss -s.

After locating the bottleneck, drill down with top, ps, etc., to identify the offending process.

Load average should be compared with the number of CPU cores (e.g., on an 8‑core machine, load 2.85 is light, 8 is fully utilized, 12 means four tasks are waiting).

CPU Performance Investigation

Quick check – the first three lines of top show load, task count and CPU usage:

%Cpu(s): 25.0 us, 5.0 sy, 0.0 ni, 68.0 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st

us – user space.

sy – kernel space.

id – idle.

wa – I/O wait.

hi/si – hardware / software interrupts.

st – stolen time (virtualised environments).

Typical interpretations:

us high, sy low : normal compute‑bound workload.

us high, sy high : many system calls or context switches.

wa high : processes blocked on I/O.

hi/si high : heavy interrupt handling (network or storage drivers).

Locate top CPU consumers : ps aux --sort=-%cpu | head -20 Look for processes with %CPU near or above 100 % on a single core.

Thread view for multithreaded processes (Java, Go):

ps -eLf | grep <pid>
top -H -p <pid>

Common high‑CPU causes :

Infinite loops or heavy calculations.

Frequent Java GC.

Catastrophic regex backtracking.

Encryption/compression.

Remediation workflow :

Identify the offending process ( ps aux --sort=-%cpu | head -10).

Inspect details ( ps -ef | grep <pid>).

Check open files/sockets ( lsof -p <pid>).

Review application logs.

Take action based on root cause – kill abnormal process, restart/roll back, adjust configuration, or scale resources.

System‑level CPU optimisations :

Adjust priority: nice -n 10 /path/to/app or renice -n 10 -p <pid>.

Bind to specific cores: taskset -c 0-3 -p <pid>.

Memory Performance Investigation

Quick view :

free -h
              total        used        free      shared  buff/cache   available
Mem:          31Gi       28Gi       1.5Gi       200Mi       1.5Gi       2.5Gi
Swap:          8Gi          0B        8Gi

Focus on the available column – Linux repurposes idle memory as cache, which can be reclaimed.

Detailed memory info : cat /proc/meminfo Key fields: MemTotal, MemFree, MemAvailable, Buffers, Cached, SwapCached, Active(anon), Inactive(anon), SwapTotal, SwapFree. Persistent non‑zero swap usage signals memory pressure.

Process memory usage (sorted): ps aux --sort=-%mem | head -20 Leak detection – monitor a suspect process over time:

for i in {1..20}; do
  ps -o pid,rss,vsz,comm -p <pid>
  sleep 5
done

Or view memory maps: pmap -x <pid> | sort -k3 -n -r | head -20 Typical high‑memory causes :

Application memory leak (common in Java).

Oversized caches (Nginx, Redis).

JVM heap larger than half of physical RAM.

Too many processes.

Swap investigation :

swapon -s
free -h
vmstat 2 5

Identify top swap users with a script that scans /proc/*/status (omitted for brevity).

Memory remediation workflow :

Confirm shortage with free -h and vmstat.

Find top consumers ( ps aux --sort=-%mem | head -10).

Analyse process type – system (kswapd, jbd2) indicates I/O pressure; business process requires log/config review; many high‑memory processes suggest overall RAM insufficiency.

Take measures – adjust JVM heap, clean unnecessary processes, add RAM, optimise application memory usage.

Disk I/O Performance Investigation

Key concepts:

IOPS – operations per second (random performance).

Throughput – MB/s or GB/s (sequential performance).

Latency – per‑operation response time (ms).

Utilisation – % of time the device handles I/O; near 100 % means saturation.

Device types differ dramatically: HDD (tens‑hundreds IOPS, ms latency), SATA SSD (tens‑thousands IOPS, sub‑ms), NVMe SSD (hundreds‑thousands IOPS, µs).

iostat usage (install sysstat first): iostat -x 2 5 Important columns: r/s, w/s, rkB/s, wkB/s, await, %util, aqu-sz. Typical I/O bottleneck thresholds: %util ≈ 100 % → device saturated. await far above normal (HDD 10‑20 ms, SSD < 1 ms) → high latency. aqu-sz > 1 (single disk) → queue buildup.

Per‑process I/O with iotop (requires root):

iotop -o   # show only active I/O processes
iotop -p <pid>

If iotop is unavailable, fall back to pidstat -d 2 5.

Common I/O culprits :

Excessive log writes.

Database I/O pressure.

Frequent temporary file usage.

Swap activity caused by memory shortage.

I/O remediation workflow :

Confirm I/O issue with iostat -x (watch %util and await).

Determine read vs. write dominance ( r/s vs w/s).

Locate offending process via iotop.

Identify problematic files or mount points with lsof -p <pid> or df -h.

Mitigate – reduce unnecessary I/O, optimise application I/O (async, batch), upgrade to faster storage, increase RAM to lower swap, clean unused files.

Network Performance Investigation

Quick overview : netstat -s # or ss -s Active connections: ss -tan state established | head -20 Bandwidth check : cat /proc/net/dev Or install iftop and run iftop -i eth0.

Connection‑state analysis (counts of each TCP state):

netstat -an | awk '/^tcp/{print $NF}' | sort | uniq -c
ss -tan state time-wait | wc -l
ss -tan state syn-recv | wc -l

Typical anomalies:

Many TIME_WAIT – frequent open/close cycles.

Many SYN_RECV – possible SYN flood.

Many ESTABLISHED without traffic – connection leak.

Latency checks :

ping -c 10 <target>
time nslookup example.com
telnet <host> <port>

NIC queues and interrupts :

cat /proc/interrupts | grep eth
ethtool -l eth0   # view queues
ethtool -L eth0 combined 8   # adjust queues

Network remediation workflow :

Confirm network issue with ping to gateway and external IP.

Analyse connection states; high SYN_RECV suggests SYN flood.

Inspect firewall rules ( iptables -L -n, -t nat).

Mitigate – rate‑limit with iptables or cloud DDoS protection, fix firewall/routing misconfigurations, upgrade bandwidth or optimise traffic.

Integrated Case Studies

Case 1 – High Load, Low CPU Usage

Symptom : uptime shows high load, but top reports low CPU utilisation.

Analysis : Load is high because processes are waiting on I/O.

Steps :

Run vmstat 2 5 – focus on r (runnable) and b (blocked).

Run iostat -x 2 5 – look for devices with high %util.

Run iotop – identify the process generating most I/O.

Root cause : Excessive log writes or database I/O.

Case 2 – Memory Continuously Grows

Symptom : free shows rising memory usage; application logs show no errors.

Analysis : Possible memory leak or cache growth.

Steps :

List processes by memory: ps aux --sort=-%mem | head -20.

Monitor a suspect process:

for i in {1..30}; do
  ps -o pid,rss,vsz,comm -p <pid>
  sleep 10
done

Check swap usage with free -h and vmstat 2 5.

Root cause : Application memory leak or oversized JVM heap.

Case 3 – High CPU, No High‑CPU Process

Symptom : top shows high CPU, but ps cannot find a process with high CPU.

Analysis : CPU consumed by system calls or interrupts.

Steps :

Inspect top for sy (system) usage.

Check interrupt counts: cat /proc/interrupts.

Check soft interrupts: cat /proc/softirqs.

Trace a suspect process: strace -p <pid> -c.

Root cause : Excessive system calls, soft interrupts, or network interrupts.

Performance Optimisation Recommendations

System Parameter Tuning

File‑descriptor limits (high‑concurrency services):

# Current limit
ulimit -n
# Temporary increase
ulimit -n 65535
# Permanent change – edit /etc/security/limits.conf
* soft nofile 65535
* hard nofile 65535

Kernel parameters (add to /etc/sysctl.conf and apply with sysctl -p):

# TCP backlog
net.ipv4.tcp_max_syn_backlog = 65535
# Socket buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# TIME_WAIT reuse
net.ipv4.tcp_tw_reuse = 1
# File‑descriptor limit
fs.file-max = 65535
# Memory behaviour
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 5

Application Configuration Tuning

JVM example:

-Xms4g -Xmx4g          # heap size
-XX:+UseG1GC          # G1 GC
-XX:MaxGCPauseMillis=200

Nginx example:

worker_processes auto;
worker_rlimit_nofile 65535;
events {
    worker_connections 65535;
    use epoll;
}

Monitoring and Alerting

Key alert thresholds (example):

Load > 0.8 × CPU cores.

Memory usage > 85 %.

Disk utilisation > 80 %.

CPU iowait > 20 %.

Sudden spike in network connections.

Implement with Prometheus + Grafana or cloud monitoring services.

Production Real‑World Cases

Java Full GC Causing Service Hang

Background : During a sales event, order API latency rose from 200 ms to 5‑10 s.

Investigation :

Load average 12.85 while CPU idle (checked with uptime and top).

Frequent Full GC observed with jstat -gc 12345 1000 10 (each > 2 s).

Heap dump ( jmap -heap 12345) showed 8 GB heap, old generation > 95 % used.

GC logs confirmed allocation rate outpacing reclamation.

Root cause : Oversized JVM heap (8 GB) caused long Full GC pauses.

Fix :

Reduce heap to 4 GB.

Switch to G1 GC with tuned pause target.

Optimise application code to lower allocation rate.

Post‑fix jstat showed far fewer Full GCs and response time returned to < 200 ms.

Nginx Connection Surge Leading to 502 Errors

Background : Site returned 502 Bad Gateway.

Investigation :

Nginx running, but netstat -an | grep :80 | wc -l showed > 60 000 connections.

Connection‑state breakdown revealed many TIME_WAIT sockets.

PHP‑FPM had only pm.max_children = 50 while connections exceeded 10 000.

Root cause : PHP‑FPM max children too low for traffic.

Fix (applied and reloaded):

# /etc/php-fpm.d/www.conf
pm.max_children = 200
pm.start_servers = 50
pm.min_spare_servers = 50
pm.max_spare_servers = 200
pm.max_requests = 500

systemctl restart php-fpm

# Nginx keep‑alive and timeout tuning
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
upstream backend {
    server 127.0.0.1:9000;
    keepalive 200;
}

MySQL Slow Query Causing Site Slowness

Background : Forum homepage load > 10 s.

Investigation :

Web server metrics normal; MySQL process list showed queries > 30 s.

Enabled slow‑query log; identified unindexed JOIN on posts table. EXPLAIN revealed full table scan; missing composite index on user_id, created_at.

Fix (performed during low traffic):

# Backup
mysqldump -u root -p mydb posts > /backup/posts.sql
# Add index
ALTER TABLE posts ADD INDEX idx_user_created (user_id, created_at);
# Optimise query
SELECT id, title, created_at FROM posts WHERE user_id = 123;

DDoS SYN Flood Attack

Background : Bandwidth saturated, SSH unreachable.

Investigation :

High inbound bytes in /proc/net/dev.

Top source IPs identified via netstat -an aggregation. ss -tan state syn-recv | wc -l showed > 50 000 SYN_RECV sockets – classic SYN flood.

Mitigation :

Block offending IPs: iptables -I INPUT -s 1.2.3.4 -j DROP (or /24).

Enable SYN‑cookies and enlarge backlog:

# /etc/sysctl.conf additions
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
sysctl -p

Deploy fail2ban for automatic banning.

For large‑scale attacks, engage cloud DDoS protection services.

Frequently Used Tuning Parameters Summary

Kernel Settings (add to /etc/sysctl.conf)

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_max_tw_buckets = 5000
net.core.somaxconn = 65535
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
fs.file-max = 65535
fs.inotify.max_user_watches = 524288
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 5

Apply with sysctl -p.

Limits.conf

* soft nofile 65535
* hard nofile 65535
* soft nproc 65535
* hard nproc 65535

Simple Monitoring Script (run daily via cron)

#!/bin/bash
LOG_FILE="/var/log/performance_$(date +%Y%m%d).log"

echo "=== $(date) ===" >> $LOG_FILE

echo "=== System Load ===" >> $LOG_FILE
uptime >> $LOG_FILE

echo "=== Memory Usage ===" >> $LOG_FILE
free -h >> $LOG_FILE

echo "=== Top 10 CPU Processes ===" >> $LOG_FILE
ps aux --sort=-%cpu | head -11 >> $LOG_FILE

echo "=== Top 10 Memory Processes ===" >> $LOG_FILE
ps aux --sort=-%mem | head -11 >> $LOG_FILE

echo "=== Disk I/O ===" >> $LOG_FILE
iostat -x 1 1 >> $LOG_FILE

echo "=== Network Connections ===" >> $LOG_FILE
ss -s >> $LOG_FILE

echo "=== Done ===" >> $LOG_FILE

Add to crontab -e:

0 9 * * * /bin/bash /opt/scripts/performance_monitor.sh

Final Takeaways

Adopt a systematic order – overall view first, then drill down.

Start with CPU, then memory, then I/O and network.

Correlate multiple metrics; a single indicator can be misleading.

Identify the root cause before taking remedial actions.

Performance tuning is continuous: build robust monitoring, perform regular health checks, test changes before production rollout, and document incidents to grow institutional knowledge.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringPerformanceLinuxTroubleshootingSystem Administration
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.