Operations 14 min read

Master Linux Ops: 20 Advanced Commands, 5 Performance Tweaks, and Real-World Case Studies

From midnight alerts of service timeouts to hidden CPU hogs and memory leaks, this guide walks Linux operators through 20 advanced commands, five performance metrics, real‑world troubleshooting scripts, and a comprehensive optimization checklist, enabling proactive system health management.

Raymond Ops
Raymond Ops
Raymond Ops
Master Linux Ops: 20 Advanced Commands, 5 Performance Tweaks, and Real-World Case Studies

Introduction

Many operators have been woken up at 2 a.m. by alerts such as “online service response timeout, massive user complaints.” The article starts by asking where to begin troubleshooting, which commands to use, and how to locate problems quickly.

What You Will Gain

🔧 Advanced usage of 20 high‑frequency operations commands (e.g., top -Hp, strace -c).

📊 Optimization methods for five key system performance indicators (CPU, memory, disk I/O, network, process management).

🚀 Three real production‑environment case studies with complete investigation steps and reusable scripts.

💡 A ready‑to‑use system‑optimization checklist for daily inspection and fault prevention.

1. Advanced Basic Commands – From “Can Use” to “Use Well”

1.1 CPU Investigation Trio: top , htop , pidstat

Most users only look at overall CPU usage with top. Experts first find the high‑CPU process PID, then inspect thread‑level usage:

# Find the PID of the high‑CPU process (e.g., 12345)
top -c

# Show CPU usage of each thread of that process
top -Hp 12345

# Convert thread ID to hexadecimal for jstack matching
printf "%x
" 12356

Pitfall: Only checking process‑level CPU missed a Java GC thread that consumed CPU for three hours. Correct practice: Always use the -Hp option to view thread‑level details.

1.2 Real‑Time Resource Monitoring for a Specific Process

# Refresh every 2 seconds for PID 12345
pidstat -u -r -d -t -p 12345 2
# -u: CPU, -r: memory, -d: disk I/O, -t: thread view

1.3 Memory Investigation Beyond free -h

# Show memory map and highlight growing segments
pmap -x 12345 | tail -5

# Continuously monitor memory growth every 5 seconds
while true; do
  date >> mem_monitor.log
  ps aux | grep 12345 | grep -v grep >> mem_monitor.log
  sleep 5
done

Pitfall: Using only free -h showed total memory drop but not the leaking process. Solution: smem -rs swap -p sorted by swap usage and identified the culprit instantly.

1.4 Deep Disk I/O Analysis with iotop

# Show only processes with I/O activity, refresh every 2 seconds
iotop -oP -d 2
# To inspect a specific process’s I/O details
cat /proc/12345/io

Real‑world tip: A MySQL slowdown was traced to excessive log writes; changing innodb_flush_log_at_trx_commit from 1 to 2 improved write performance fivefold.

2. System Optimization in Practice – From Theory to Implementation

2.1 CPU Optimization – Beyond Nice Values

Solution 1: CPU affinity binding

# Bind Nginx workers to specific CPU cores (nginx.conf)
worker_processes 4;
worker_cpu_affinity 0001 0010 0100 1000;
# Verify binding
taskset -cp $(pgrep nginx)

Binding reduces cache misses and can boost performance by ~15%.

Solution 2: Interrupt load balancing

# View current interrupt distribution
cat /proc/interrupts
# Bind network‑card interrupt (IRQ 24) to a CPU
echo 2 > /proc/irq/24/smp_affinity

2.2 Memory Optimization – Proper Swappiness Settings

# Check current value
cat /proc/sys/vm/swappiness
# Temporary change (lost after reboot)
echo 10 > /proc/sys/vm/swappiness
# Permanent change
echo "vm.swappiness = 10" >> /etc/sysctl.conf
sysctl -p

Recommended values differ by role: database servers 1‑10, application servers 30‑60, desktop 60. Setting swappiness to 0 can cause OOM kills under memory pressure.

2.3 Network Optimization – TCP Parameter Tuning

# Increase TCP connection queue
echo 'net.core.somaxconn = 65535' >> /etc/sysctl.conf
# Raise SYN backlog
echo 'net.ipv4.tcp_max_syn_backlog = 8192' >> /etc/sysctl.conf
# Reuse TIME_WAIT sockets (safe), do NOT enable tcp_tw_recycle in production
echo 'net.ipv4.tcp_tw_reuse = 1' >> /etc/sysctl.conf
sysctl -p

Note: tcp_tw_recycle breaks connections behind NAT and must never be enabled in production.

3. Real‑World Cases – Full Fault‑Investigation Process

Case 1: CPU 100 % with No Visible Process

Symptom: Monitoring shows 100 % CPU, but top shows no high‑CPU process.

# Step 1: Look for hidden processes
ps aux | awk '{print $3}' | sort -rn | head -10

# Step 2: Check kernel threads
ps aux | grep "\[.*\]"

# Step 3: Examine I/O wait
top

# Step 4: Identify I/O bottleneck
iotop -oP
# Result: rsync backup caused high I/O wait, making CPU appear busy

Resolution:

Switch rsync to incremental backup: rsync -avz --delete Limit bandwidth: --bwlimit=10240 (10 MB/s)

Schedule backups during low‑traffic periods.

Case 2: Memory Leak Cascade

A reusable Bash script detects processes with significant memory growth over a minute.

#!/bin/bash
# check_memory_leak.sh – detect possible memory leaks

echo "=== Memory Leak Detection Script ==="
echo "Monitoring memory usage for 60 seconds..."

tmpfile=$(mktemp)
# First sample
ps aux --sort=-%mem | head -20 | awk '{print $2,$4,$11}' > ${tmpfile}.1
sleep 60
# Second sample
ps aux --sort=-%mem | head -20 | awk '{print $2,$4,$11}' > ${tmpfile}.2

echo -e "
=== Processes with significant memory growth ==="
echo "PID    MEM_BEFORE  MEM_AFTER  GROWTH  COMMAND"
while read pid mem1 cmd1; do
  mem2=$(grep "^${pid} " ${tmpfile}.2 | awk '{print $2}')
  if [ -n "$mem2" ]; then
    growth=$(echo "$mem2 - $mem1" | bc)
    if (( $(echo "$growth > 0.5" | bc -l) )); then
      printf "%-6s %-10s %-10s %-7s %s
" "$pid" "$mem1%" "$mem2%" "+$growth%" "$cmd1"
    fi
  fi
done < ${tmpfile}.1
rm -f ${tmpfile}*

echo -e "
=== Recommendation ==="
echo "For suspicious processes, use: pmap -x PID"
echo "Or check memory maps: cat /proc/PID/smaps"

Usage:

chmod +x check_memory_leak.sh
./check_memory_leak.sh

4. System‑Optimization Checklist (Daily / Weekly / Monthly)

Daily Checks

CPU load: uptime – ensure 1/5/15‑minute load < 0.7 × CPU cores.

Memory: free -h – keep free memory > 20 %.

Disk space: df -h – all partitions < 80 % usage.

Critical services: systemctl status nginx/mysql/redis – verify they are running.

Weekly Optimizations

Clean old logs: find /var/log -name "*.log" -mtime +30 -exec rm {} \; Detect zombie processes: ps aux | grep defunct Analyze MySQL slow queries.

Apply security updates: yum update --security or apt-get upgrade.

Monthly Deep‑Dive

Defragment ext4 (if needed): e4defrag /dev/sda1 TCP connection analysis: ss -s Review kernel parameters in /etc/sysctl.conf against best practices.

Conclusion

By mastering the combination of advanced commands, concrete optimization techniques, and systematic troubleshooting scripts, operators can shift from reactive “fire‑fighting” to proactive system architecture, detecting warning signs early and restoring services swiftly.

Technical References

GitHub repository: https://github.com/raymond999999

Gitee repository: https://gitee.com/raymond9

Performance TuningLinuxShell Commands
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.