Master Linux Ops: 20 Advanced Commands, 5 Performance Tweaks, and Real-World Case Studies
From midnight alerts of service timeouts to hidden CPU hogs and memory leaks, this guide walks Linux operators through 20 advanced commands, five performance metrics, real‑world troubleshooting scripts, and a comprehensive optimization checklist, enabling proactive system health management.
Introduction
Many operators have been woken up at 2 a.m. by alerts such as “online service response timeout, massive user complaints.” The article starts by asking where to begin troubleshooting, which commands to use, and how to locate problems quickly.
What You Will Gain
🔧 Advanced usage of 20 high‑frequency operations commands (e.g., top -Hp, strace -c).
📊 Optimization methods for five key system performance indicators (CPU, memory, disk I/O, network, process management).
🚀 Three real production‑environment case studies with complete investigation steps and reusable scripts.
💡 A ready‑to‑use system‑optimization checklist for daily inspection and fault prevention.
1. Advanced Basic Commands – From “Can Use” to “Use Well”
1.1 CPU Investigation Trio: top , htop , pidstat
Most users only look at overall CPU usage with top. Experts first find the high‑CPU process PID, then inspect thread‑level usage:
# Find the PID of the high‑CPU process (e.g., 12345)
top -c
# Show CPU usage of each thread of that process
top -Hp 12345
# Convert thread ID to hexadecimal for jstack matching
printf "%x
" 12356Pitfall: Only checking process‑level CPU missed a Java GC thread that consumed CPU for three hours. Correct practice: Always use the -Hp option to view thread‑level details.
1.2 Real‑Time Resource Monitoring for a Specific Process
# Refresh every 2 seconds for PID 12345
pidstat -u -r -d -t -p 12345 2
# -u: CPU, -r: memory, -d: disk I/O, -t: thread view1.3 Memory Investigation Beyond free -h
# Show memory map and highlight growing segments
pmap -x 12345 | tail -5
# Continuously monitor memory growth every 5 seconds
while true; do
date >> mem_monitor.log
ps aux | grep 12345 | grep -v grep >> mem_monitor.log
sleep 5
donePitfall: Using only free -h showed total memory drop but not the leaking process. Solution: smem -rs swap -p sorted by swap usage and identified the culprit instantly.
1.4 Deep Disk I/O Analysis with iotop
# Show only processes with I/O activity, refresh every 2 seconds
iotop -oP -d 2
# To inspect a specific process’s I/O details
cat /proc/12345/ioReal‑world tip: A MySQL slowdown was traced to excessive log writes; changing innodb_flush_log_at_trx_commit from 1 to 2 improved write performance fivefold.
2. System Optimization in Practice – From Theory to Implementation
2.1 CPU Optimization – Beyond Nice Values
Solution 1: CPU affinity binding
# Bind Nginx workers to specific CPU cores (nginx.conf)
worker_processes 4;
worker_cpu_affinity 0001 0010 0100 1000;
# Verify binding
taskset -cp $(pgrep nginx)Binding reduces cache misses and can boost performance by ~15%.
Solution 2: Interrupt load balancing
# View current interrupt distribution
cat /proc/interrupts
# Bind network‑card interrupt (IRQ 24) to a CPU
echo 2 > /proc/irq/24/smp_affinity2.2 Memory Optimization – Proper Swappiness Settings
# Check current value
cat /proc/sys/vm/swappiness
# Temporary change (lost after reboot)
echo 10 > /proc/sys/vm/swappiness
# Permanent change
echo "vm.swappiness = 10" >> /etc/sysctl.conf
sysctl -pRecommended values differ by role: database servers 1‑10, application servers 30‑60, desktop 60. Setting swappiness to 0 can cause OOM kills under memory pressure.
2.3 Network Optimization – TCP Parameter Tuning
# Increase TCP connection queue
echo 'net.core.somaxconn = 65535' >> /etc/sysctl.conf
# Raise SYN backlog
echo 'net.ipv4.tcp_max_syn_backlog = 8192' >> /etc/sysctl.conf
# Reuse TIME_WAIT sockets (safe), do NOT enable tcp_tw_recycle in production
echo 'net.ipv4.tcp_tw_reuse = 1' >> /etc/sysctl.conf
sysctl -pNote: tcp_tw_recycle breaks connections behind NAT and must never be enabled in production.
3. Real‑World Cases – Full Fault‑Investigation Process
Case 1: CPU 100 % with No Visible Process
Symptom: Monitoring shows 100 % CPU, but top shows no high‑CPU process.
# Step 1: Look for hidden processes
ps aux | awk '{print $3}' | sort -rn | head -10
# Step 2: Check kernel threads
ps aux | grep "\[.*\]"
# Step 3: Examine I/O wait
top
# Step 4: Identify I/O bottleneck
iotop -oP
# Result: rsync backup caused high I/O wait, making CPU appear busyResolution:
Switch rsync to incremental backup: rsync -avz --delete Limit bandwidth: --bwlimit=10240 (10 MB/s)
Schedule backups during low‑traffic periods.
Case 2: Memory Leak Cascade
A reusable Bash script detects processes with significant memory growth over a minute.
#!/bin/bash
# check_memory_leak.sh – detect possible memory leaks
echo "=== Memory Leak Detection Script ==="
echo "Monitoring memory usage for 60 seconds..."
tmpfile=$(mktemp)
# First sample
ps aux --sort=-%mem | head -20 | awk '{print $2,$4,$11}' > ${tmpfile}.1
sleep 60
# Second sample
ps aux --sort=-%mem | head -20 | awk '{print $2,$4,$11}' > ${tmpfile}.2
echo -e "
=== Processes with significant memory growth ==="
echo "PID MEM_BEFORE MEM_AFTER GROWTH COMMAND"
while read pid mem1 cmd1; do
mem2=$(grep "^${pid} " ${tmpfile}.2 | awk '{print $2}')
if [ -n "$mem2" ]; then
growth=$(echo "$mem2 - $mem1" | bc)
if (( $(echo "$growth > 0.5" | bc -l) )); then
printf "%-6s %-10s %-10s %-7s %s
" "$pid" "$mem1%" "$mem2%" "+$growth%" "$cmd1"
fi
fi
done < ${tmpfile}.1
rm -f ${tmpfile}*
echo -e "
=== Recommendation ==="
echo "For suspicious processes, use: pmap -x PID"
echo "Or check memory maps: cat /proc/PID/smaps"Usage:
chmod +x check_memory_leak.sh
./check_memory_leak.sh4. System‑Optimization Checklist (Daily / Weekly / Monthly)
Daily Checks
CPU load: uptime – ensure 1/5/15‑minute load < 0.7 × CPU cores.
Memory: free -h – keep free memory > 20 %.
Disk space: df -h – all partitions < 80 % usage.
Critical services: systemctl status nginx/mysql/redis – verify they are running.
Weekly Optimizations
Clean old logs: find /var/log -name "*.log" -mtime +30 -exec rm {} \; Detect zombie processes: ps aux | grep defunct Analyze MySQL slow queries.
Apply security updates: yum update --security or apt-get upgrade.
Monthly Deep‑Dive
Defragment ext4 (if needed): e4defrag /dev/sda1 TCP connection analysis: ss -s Review kernel parameters in /etc/sysctl.conf against best practices.
Conclusion
By mastering the combination of advanced commands, concrete optimization techniques, and systematic troubleshooting scripts, operators can shift from reactive “fire‑fighting” to proactive system architecture, detecting warning signs early and restoring services swiftly.
Technical References
GitHub repository: https://github.com/raymond999999
Gitee repository: https://gitee.com/raymond9
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
