Master Linux Production Performance: A Complete Hands‑On Optimization Guide
This comprehensive guide walks you through building a monitoring baseline, tuning CPU, memory, disk I/O, and network settings, applying kernel and application optimizations, and setting up automated monitoring and alerts to dramatically improve Linux production system performance.
Master Linux Production Performance: A Complete Hands‑On Optimization Guide
Guide: As a veteran operations engineer, I share a full set of Linux performance‑tuning experiences to fundamentally improve system performance and avoid incidents.
Why performance tuning matters
In high‑concurrency internet services, a 1 second delay can reduce conversion by 7 %. Tuning kernel parameters can cut response time from 800 ms to 200 ms, boosting business by 30 %.
Step 1: Build a performance‑monitoring baseline
Core metric dimensions
Before tuning, establish a complete monitoring system – the “five‑dimensional monitoring model”.
CPU dimension
# Real‑time CPU usage and load
top -p $(pgrep -d ',' your_process)
htop -p $(pgrep your_process)
# CPU details
lscpu
cat /proc/cpuinfo | grep "processor\|model name"
# Context‑switch monitoring
vmstat 1 5
sar -w 1 5Memory dimension
# Memory usage details
free -h
cat /proc/meminfo
ps aux --sort=-%mem | head -10
# Detect memory leaks
valgrind --tool=memcheck --leak-check=full ./your_programDisk I/O dimension
# Disk I/O performance
iostat -x 1 5
iotop -o
lsblk -f
# Disk space monitoring
df -h
du -sh /* | sort -hrNetwork dimension
# Network traffic monitoring
iftop -i eth0
nethogs
ss -tuln
# Connection status
netstat -ant | awk '{print $6}' | sort | uniq -cProcess dimension
# Process resource consumption
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head
pstree -p
lsof -p PIDBaseline collection script
#!/bin/bash
# performance_baseline.sh – collect system baseline
echo "=== System Performance Baseline $(date) ==="
# (script continues with CPU, memory, disk, network, and load info collection)Step 2: CPU performance tuning
CPU affinity
# View CPU topology
lscpu -e
numactl --hardware
# Bind process to specific cores
taskset -cp 0-3 PID
numactl --cpubind=0 --membind=0 your_program
# Interrupt balancing
echo 2 > /proc/irq/24/smp_affinity
systemctl enable irqbalanceProcess scheduling
# Adjust process priority
nice -n -10 your_critical_process
renice -10 PID
# Real‑time scheduling
chrt -f -p 99 PID
chrt -r -p 50 PIDCPU frequency management
# Set performance governor
echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
cpupower frequency-set -g performanceStep 3: Memory tuning
Memory parameter tuning
# Reduce swap usage
echo 1 > /proc/sys/vm/swappiness
# Adjust dirty page ratios
echo 1 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
echo 60 > /proc/sys/vm/dirty_expire_centisecs
# Drop caches
echo 1 > /proc/sys/vm/drop_caches
# Transparent hugepage
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defragMemory leak detection script
#!/bin/bash
# memory_leak_monitor.sh
# (script monitors a process and sends alerts when memory usage exceeds a threshold)Huge page configuration
# Allocate huge pages
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# Mount hugepage filesystem
mount -t hugetlbfs nodev /mnt/hugepages
echo 'nodev /mnt/hugepages hugetlbfs defaults 0 0' >> /etc/fstabStep 4: Disk I/O optimization
Filesystem tuning
# ext4 mount options
mount -o remount,noatime,nodiratime,data=writeback /dev/sda1 /
# XFS mount options
mount -o remount,noatime,nodiratime,logbufs=8,logbsize=256k /dev/sdb1 /dataI/O scheduler
# View current scheduler
cat /sys/block/sda/queue/scheduler
# Set scheduler
# SSD – deadline or noop
echo deadline > /sys/block/sda/queue/scheduler
# HDD – cfq
echo cfq > /sys/block/sdb/queue/schedulerDisk performance testing
# Sequential read/write test
dd if=/dev/zero of=/tmp/testfile bs=1G count=1 oflag=direct
dd if=/tmp/testfile of=/dev/null bs=1G count=1 iflag=direct
# Random I/O test (fio required)
fio -filename=/tmp/fio_test -direct=1 -iodepth=64 -thread -rw=randrw -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=60 -group_reporting -name=randrw_testStep 5: Network performance tuning
TCP/IP parameters
# Append to /etc/sysctl.conf
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.netfilter.nf_conntrack_max = 1048576
net.ipv4.tcp_fin_timeout = 30
net.core.netdev_max_backlog = 5000
net.core.somaxconn = 65535
# (additional parameters omitted for brevity)
sysctl -pNIC multi‑queue configuration
# View NIC queues
ethtool -l eth0
# Set combined queues
ethtool -L eth0 combined 4
# IRQ affinity
echo 1 > /proc/irq/24/smp_affinity
echo 2 > /proc/irq/25/smp_affinityNetwork monitoring script
#!/bin/bash
# network_monitor.sh
INTERFACE="eth0"
LOG_FILE="/var/log/network_performance.log"
while true; do
RX_BYTES=$(cat /sys/class/net/$INTERFACE/statistics/rx_bytes)
TX_BYTES=$(cat /sys/class/net/$INTERFACE/statistics/tx_bytes)
RX_PACKETS=$(cat /sys/class/net/$INTERFACE/statistics/rx_packets)
TX_PACKETS=$(cat /sys/class/net/$INTERFACE/statistics/tx_packets)
echo "$(date): RX: $RX_BYTES bytes, $RX_PACKETS packets | TX: $TX_BYTES bytes, $TX_PACKETS packets" >> $LOG_FILE
sleep 10
doneStep 6: Kernel parameter tuning
System‑wide sysctl settings
# /etc/sysctl.conf example
kernel.pid_max = 4194304
kernel.threads-max = 4194304
fs.file-max = 6553600
vm.swappiness = 1
vm.dirty_background_ratio = 1
vm.dirty_ratio = 10
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_max_syn_backlog = 65536
# (additional parameters omitted)
sysctl -pFile‑descriptor limits
# /etc/security/limits.conf
* soft nofile 1048576
* hard nofile 1048576
root soft nofile 1048576
root hard nofile 1048576
# systemd limits
[Manager]
DefaultLimitNOFILE=1048576
DefaultLimitNPROC=1048576Step 7: Application‑layer tuning
Database optimization (MySQL/MariaDB)
# /etc/mysql/mysql.conf.d/mysqld.cnf
[mysqld]
innodb_buffer_pool_size = 8G
innodb_log_file_size = 512M
query_cache_type = 1
query_cache_size = 256M
max_connections = 1000
slow_query_log = 1
# (additional settings omitted)Web server (Nginx) tuning
# /etc/nginx/nginx.conf
worker_processes auto;
worker_cpu_affinity auto;
worker_rlimit_nofile 1048576;
events { worker_connections 65535; use epoll; multi_accept on; }
http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
gzip on;
open_file_cache max=200000 inactive=20s;
client_body_buffer_size 128k;
client_max_body_size 50m;
include /etc/nginx/conf.d/*.conf;
}Step 8: Troubleshooting toolbox
System analysis tools
# Install common tools
yum install -y sysstat iotop htop glances perf strace tcpdump wireshark-cli
apt-get install -y sysstat iotop htop glances linux-tools-generic strace tcpdump tsharkOne‑click health‑check script
#!/bin/bash
# system_health_check.sh
# (script prints CPU, memory, disk, network, process, and service status and raises alerts when thresholds are exceeded)Performance bottleneck finder
#!/bin/bash
# performance_bottleneck_finder.sh
# (script checks CPU idle, memory usage, and I/O wait and suggests actions)Step 9: Automated monitoring and alerting
Prometheus + Grafana configuration
# prometheus.yml (excerpt)
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'mysql-exporter'
static_configs:
- targets: ['localhost:9104']Alert rules
# alert_rules.yml (excerpt)
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 90% for more than 5 minutes"
- alert: DiskSpaceLow
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
for: 1m
labels:
severity: warning
annotations:
summary: "Disk space low"
description: "Disk usage is above 85%"Step 10: Best‑practice checklist
System level: keep load < CPU cores, memory usage < 80 %, swap < 5 %, I/O wait < 10 %.
Application level: proper DB connection pool, cache hit rate > 90 %, enable static‑resource compression.
Network level: TCP connections within limits, latency reasonable, bandwidth monitored.
Core principles
Monitoring first : No monitoring, no optimization.
Baseline establishment : Define before‑after metrics.
Iterative tuning : Avoid massive one‑shot changes.
Test verification : Validate each change thoroughly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
