Master Linux Outages: Proven Troubleshooting Strategies & Solutions for Common Ops Failures
This comprehensive guide walks you through systematic troubleshooting methods for frequent Linux operational incidents—covering CPU spikes, memory leaks, disk I/O bottlenecks, network glitches, database issues, container problems, and proactive monitoring—so you can quickly pinpoint root causes and restore services.
Learn from Linux Failures: A Comprehensive Collection of Common Ops Fault Diagnosis and Solutions
At 3 a.m. an urgent phone call rings, and the monitoring alarm shows a service outage. As an operations engineer, this scenario is familiar. This article shares the troubleshooting mindset and solutions distilled from thousands of production incidents, helping you quickly locate problems and restore services during critical moments.
Introduction: Why Troubleshooting Ability Determines Your Ops Ceiling
Excellent operations versus average operations does not depend on the number of tools you know, but on the thinking process and efficiency when a fault occurs.
This article uses real cases to systematically share the most common Linux operational fault scenarios and their troubleshooting methods. Whether you are a newcomer or a veteran, you will find valuable insights.
1. Build a Systematic Fault‑Diagnosis Thinking Model
1.1 Golden Diagnosis Rule: the STEP Model
Before handling any incident, remember this model:
S ymptom: accurately describe the observed issue.
T ime: determine when the fault occurred.
E nvironment: understand the system environment and recent changes.
P roblem: locate the root cause and resolve it.
1.2 Fault‑Priority Determination Matrix
Urgency × Impact = Handling Priority
P0: Core business completely down (immediate handling)
P1: Core business partially down (handle within 15 minutes)
P2: Non‑core impact (handle within 1 hour)
P3: Isolated user impact (handle in scheduled window)2. CPU‑Related Fault Diagnosis in Practice
2.1 Diagnosing 100% CPU Usage
Symptom: System response is slow, and top shows CPU usage constantly at 100%.
Investigation Steps:
# 1. Find the process with the highest CPU consumption
top -c
# Press 'P' to sort by CPU and note the PID
# 2. View thread‑level CPU usage for the process
top -Hp [PID]
# 3. Get stack trace of the problematic thread
# Convert thread ID to hexadecimal
printf "%x
" [ThreadID]
# 4. If it is a Java process, print the thread stack
jstack [PID] | grep -A 20 [hexThreadID]
# 5. Use perf to analyze CPU hotspots
perf top -p [PID]
perf record -p [PID] -g -- sleep 30
perf reportReal‑World Case: During a promotion period, an e‑commerce platform experienced a CPU spike.
Top showed a Java process consuming 800% CPU (8 cores fully loaded). jstack revealed many threads blocked on a synchronized method.
Code review found that log writing used a synchronized lock.
Solution: switch to an asynchronous logging framework; CPU dropped to 30% immediately.
Takeaways:
Common causes of high CPU: infinite loops, regex backtracking, frequent GC, lock contention.
Adopt tools like perf and flame graphs.
Establish CPU usage baselines to detect anomalies early.
2.2 Persistent High Load Average
Symptom: Load Average exceeds 30+ while CPU usage is only 50%.
# 1. Check load and process status
uptime
vmstat 1 5
# 2. Analyze process state distribution
ps aux | awk '{print $8}' | sort | uniq -c
# 3. Find processes in D (uninterruptible sleep) state
ps aux | grep " D "
# 4. Analyze I/O wait
iostat -x 1 5
iotop -o
# 5. Inspect system calls
strace -c -p [PID]Root Cause Analysis: High load with low CPU usually indicates I/O wait or lock wait.
3. Deep Dive into Memory Faults
3.1 Full Process for Memory Leak Diagnosis
Symptom: Memory usage keeps growing until an OOM kill occurs.
# 1. View memory usage trends
free -h
cat /proc/meminfo
# 2. Find the process consuming the most memory
ps aux --sort=-%mem | head
# 3. Inspect process memory mapping
pmap -x [PID]
cat /proc/[PID]/status | grep -i vm
# 4. Memory leak detection for C/C++ programs
valgrind --leak-check=full --show-leak-kinds=all ./program
# 5. Java memory analysis
jmap -heap [PID]
jmap -histo:live [PID]
jmap -dump:format=b,file=heap.bin [PID]
# Use MAT or jhat to analyze the dumpReal‑World Case: Redis memory abnormal growth.
# 1. Check Redis memory info
redis-cli info memory
# 2. Analyze large keys
redis-cli --bigkeys
# 3. Sample key distribution
redis-cli --memkeys
# 4. Problem discovered: a hash key with 10 million fields
# 5. Solution: split the large key and set reasonable expiration times3.2 Cache Hit‑Rate Optimization
Metrics:
# View system cache status
free -h
cat /proc/meminfo | grep -E "Cached|Buffer"
# View page cache hit rate
sar -B 1 10
# Clear cache (use with caution in production)
echo 1 > /proc/sys/vm/drop_caches # page cache
echo 2 > /proc/sys/vm/drop_caches # dentries and inodes
echo 3 > /proc/sys/vm/drop_caches # all caches4. Disk I/O Fault Handling
4.1 I/O Performance Bottleneck Identification
Symptom: Database response slows, I/O wait stays high.
# 1. View disk I/O statistics
iostat -x 1 10
# Focus on %util, await, r/s, w/s
# 2. Find processes performing I/O
iotop -o -P
# 3. Trace specific I/O operations
blktrace -d /dev/sda -o trace
blkparse trace.* | head -n 100
# 4. View filesystem cache
slabtop
# 5. Analyze process I/O patterns
pidstat -d 1 104.2 Quick Disk Space Cleanup
# 1. Find large files quickly
du -h / 2>/dev/null | grep '[0-9]G' | sort -rn
# 2. Find recently modified large files
find / -type f -mtime -1 -size +100M 2>/dev/null
# 3. Find deleted but still‑held files
lsof | grep deleted
# 4. Check inode usage
df -i
# 5. Find directories with most inode consumption
for i in /*; do echo $i; find $i -type f | wc -l; doneClassic Case: Log file deleted but space not reclaimed because a process still writes to the deleted file.
# Problem: rm removed a large log, but df shows no space freed
# Reason: a process is still writing to the deleted file
# Solution 1: locate the holding process and restart
lsof | grep deleted
kill -USR1 [nginx_pid] # reopen log file
# Solution 2: truncate the file instead of deleting
> /var/log/large.log # recommended5. Network Fault Diagnosis Techniques
5.1 Network Connection Fault Localization
# 1. Check connectivity
ping -c 4 targetIP
traceroute targetIP
mtr targetIP
# 2. Check DNS resolution
nslookup domain.com
dig +trace domain.com
# 3. Check port connectivity
telnet IP PORT
nc -zv IP PORT
# 4. View connection state
ss -antp
netstat -antp | awk '{print $6}' | sort | uniq -c
# 5. Capture packets for analysis
tcpdump -i eth0 -nn -s0 -v port 80
tcpdump -i eth0 -w capture.pcap # analyze later with Wireshark5.2 Excessive TIME_WAIT Connections
Symptom: Large number of TIME_WAIT sockets exhaust ports.
# View TIME_WAIT count
ss -ant | grep TIME-WAIT | wc -l
# Tune kernel parameters
cat >> /etc/sysctl.conf <<EOF
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 0 # not recommended to enable
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_fin_timeout = 30
EOF
sysctl -p5.3 Network Performance Optimization
# 1. View traffic
iftop -i eth0
nethogs
# 2. View errors and packet loss
ip -s link show eth0
ethtool -S eth0
# 3. Optimize socket buffers
echo 'net.core.rmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' >> /etc/sysctl.conf6. Process and Service Fault Handling
6.1 Zombie Process Handling
# Find zombie processes
ps aux | grep defunct
ps aux | awk '$8 ~ /^Z/ { print }'
# Find parent PIDs of zombies
ps -ef | grep defunct | grep -v grep | awk '{print $3}' | sort | uniq
# Remedy
# 1. Send SIGCHLD to the parent
kill -SIGCHLD [parentPID]
# 2. If ineffective, restart the parent service
systemctl restart [serviceName]6.2 Service Startup Failure Diagnosis
# 1. Check service status and logs
systemctl status service_name
journalctl -u service_name -n 50
# 2. Check port occupation
ss -tlnp | grep :port
lsof -i :port
# 3. Verify configuration syntax
nginx -t
httpd -t
mysql --help --verbose | grep -A 1 "Default options"
# 4. Check permissions
ls -la /path/to/service
namei -l /path/to/service/file
# 5. Check SELinux
getenforce
setenforce 0 # temporary disable for testing
ausearch -m AVC -ts recent7. System Log Analysis Techniques
7.1 Efficient Log‑Analysis Commands
# 1. Quickly locate error logs
grep -E "ERROR|WARN|FATAL" /var/log/app.log | tail -100
# 2. Count error frequency
awk '/ERROR/ {print $1,$2}' app.log | uniq -c | sort -rn
# 3. Analyze access logs – status code distribution
awk '{print $9}' access.log | sort | uniq -c | sort -rn
# 4. List slow requests (>1 s)
awk '$NF > 1000 {print $7,$NF}' access.log | sort -k2 -rn | head
# 5. Real‑time monitoring
tail -f /var/log/messages | grep --line-buffered ERROR
# 6. Merge compressed logs for analysis
zcat /var/log/app.log*.gz | grep ERROR | less7.2 Using journalctl Efficiently
# View logs for a specific time range
journalctl --since "2024-01-01 00:00:00" --until "2024-01-01 01:00:00"
# View logs of a specific service
journalctl -u nginx.service -f
# View kernel logs
journalctl -k
# View logs of a specific priority
journalctl -p err..emerg
# Export logs for analysis
journalctl -u service_name -o json > service.json8. Database Fault Handling
8.1 MySQL Performance Issue Diagnosis
# 1. View slow‑query settings
show variables like 'slow_query%';
show variables like 'long_query_time';
# 2. View currently running SQL
show processlist;
show full processlist;
# 3. Check lock waits
show engine innodb status\G
SELECT * FROM information_schema.INNODB_LOCKS;
SELECT * FROM information_schema.INNODB_LOCK_WAITS;
# 4. View table lock status
show open tables where in_use > 0;
# 5. Analyze execution plan
explain select * from table where condition;Deadlock Handling Example:
# View recent deadlock information
SHOW ENGINE INNODB STATUS\G
# Find transactions holding locks
SELECT * FROM information_schema.INNODB_TRX\G
# Kill the offending transaction
KILL [processID];8.2 Redis Fault Handling
# 1. Check Redis status
redis-cli ping
redis-cli info
# 2. View slow queries
redis-cli slowlog get 10
# 3. List client connections
redis-cli client list
# 4. View memory usage
redis-cli info memory
# 5. Emergency cache clear (use with caution)
redis-cli FLUSHDB # clear current DB
redis-cli FLUSHALL # clear all DBs (dangerous)9. Containerized Environment Fault Diagnosis
9.1 Docker Container Issues
# 1. View container status
docker ps -a
docker inspect [containerID]
# 2. View container logs
docker logs --tail 100 -f [containerID]
# 3. Enter container for debugging
docker exec -it [containerID] /bin/bash
# 4. View container resource usage
docker stats
docker top [containerID]
# 5. Check container networking
docker network ls
docker port [containerID]9.2 Kubernetes Fault Diagnosis
# 1. View pod status
kubectl get pods -o wide
kubectl describe pod [pod-name]
# 2. View pod logs
kubectl logs -f [pod-name]
kubectl logs -f [pod-name] -c [container-name]
# 3. Enter pod for debugging
kubectl exec -it [pod-name] -- /bin/bash
# 4. View events
kubectl get events --sort-by=.metadata.creationTimestamp
# 5. View resource usage
kubectl top nodes
kubectl top pods10. Automated Fault‑Handling Scripts
10.1 CPU Monitoring Alert Script
#!/bin/bash
# CPU monitoring alert script
CPU_THRESHOLD=80
LOAD_THRESHOLD=10
# Get CPU usage
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}')
# Alert if thresholds exceeded
if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then
echo "Warning: High CPU usage: ${CPU_USAGE}%"
curl -X POST "https://your-webhook-url" \
-H "Content-Type: application/json" \
-d "{\"text\":\"CPU alert: usage ${CPU_USAGE}%\"}"
fi
if (( $(echo "$LOAD_AVG > $LOAD_THRESHOLD" | bc -l) )); then
echo "Warning: High system load: ${LOAD_AVG}"
ps aux --sort=-%cpu | head -10 > /tmp/high_cpu_processes.log
fi10.2 Disk Space Auto‑Cleanup Script
#!/bin/bash
# Disk space auto‑cleanup script
THRESHOLD=80
LOG_DIR="/var/log"
DAYS_TO_KEEP=7
# Check disk usage
USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $USAGE -gt $THRESHOLD ]; then
echo "Disk usage exceeds ${THRESHOLD}%, starting cleanup..."
# Clean old logs
find $LOG_DIR -name "*.log" -mtime +$DAYS_TO_KEEP -delete
find $LOG_DIR -name "*.gz" -mtime +$DAYS_TO_KEEP -delete
# Clean temporary files
find /tmp -type f -mtime +7 -delete
find /var/tmp -type f -mtime +7 -delete
# Clean package manager cache
yum clean all 2>/dev/null || apt-get clean 2>/dev/null
# Vacuum journal logs older than 7 days
journalctl --vacuum-time=7d
echo "Cleanup complete, current disk usage: $(df -h / | awk 'NR==2 {print $5}')"
fi11. Fault Prevention and Monitoring Construction
11.1 Building a Complete Monitoring System
Key Monitoring Indicators :
System Level
CPU usage, Load Average
Memory usage, Swap usage
Disk I/O, Disk utilization
Network traffic, Connection count
Application Level
Response time (P50/P90/P99)
Error rate, success rate
QPS/TPS
Business‑specific metrics
Alert Strategy
# Prometheus alert rule example
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage too high"
description: "CPU usage exceeds 80% for more than 5 minutes"11.2 Fault Drills and Playbooks
Regular Fault Drills:
Simulate CPU‑max scenarios.
Simulate memory leaks.
Simulate disk failures.
Simulate network partitions.
Playbook Template:
## Fault Playbook Template
### 1. Fault Type: [Specific fault]
### 2. Fault Level: [P0/P1/P2/P3]
### 3. Impact Scope: [Affected services and users]
### 4. Handling Process:
1. Step one: [Action]
2. Step two: [Action]
3. Step three: [Action]
### 5. Rollback Plan: [How to rollback]
### 6. Owner: [Contact information]12. Performance Optimization Best Practices
12.1 Kernel Parameter Tuning
# /etc/sysctl.conf optimizations
# Network
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_tw_reuse = 1
# Memory
vm.swappiness = 10
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
# Filesystem
fs.file-max = 2097152
fs.nr_open = 209715212.2 Application‑Level Optimization Suggestions
Database Optimization
Use indexes wisely.
Avoid N+1 queries.
Employ connection pools.
Implement read/write splitting.
Cache Strategy
Multi‑level cache architecture.
Cache pre‑warming.
Cache update policies.
Prevent cache avalanche.
Code Optimization
Asynchronous processing.
Batch operations.
Connection reuse.
Resource pooling.
13. Operations Tools Recommendation
13.1 Essential Command‑Line Tools
htop – friendly top.
iotop – I/O monitor.
iftop – network traffic monitor.
sysstat – system statistics suite.
dstat – comprehensive monitoring.
strace – system‑call tracing.
ltrace – library‑call tracing.
perf – performance analysis.
mtr – network diagnosis.
nmap – port scanning.
tcpdump – packet capture.
ss – socket statistics.
13.2 Monitoring Platform Recommendations
Open‑Source Solutions
Prometheus + Grafana
Zabbix
ELK Stack
OpenFalcon
Commercial Solutions
Datadog
New Relic
Alibaba Cloud ARMS
Tencent Cloud Monitor
14. Real‑World Fault Cases
Case 1: Double‑11 E‑Commerce Platform Outage
Background: Traffic surged tenfold during Double‑11, causing system collapse.
Symptoms:
Users cannot place orders.
Page load times timeout.
Database connection pool exhausted.
Investigation:
Detected database connections hitting the limit.
Slow‑query logs showed many full‑table scans.
Identified hotspot queries lacking indexes.
Added missing indexes; connection count returned to normal.
Lessons Learned:
Load testing must reflect real‑world scenarios.
Implement SQL review mechanisms.
Set reasonable monitoring thresholds.
Case 2: Redis Cache Avalanche Causing Service Collapse
Failure Process:
00:00 – A Redis node crashes
00:01 – Massive requests fall back to the database
00:03 – Database CPU spikes to 100%
00:05 – Entire service becomes unavailableResolution:
Urgently scale the database.
Apply traffic throttling and degradation.
Repair the failed Redis node.
Improve cache strategy with random expiration times.
15. Continuous Learning and Growth Advice
15.1 Technical Growth Roadmap
Foundation Stage (0‑2 years)
Master Linux basic commands.
Understand system fundamentals.
Learn basic fault handling.
Intermediate Stage (2‑5 years)
Dive deep into kernel internals.
Master performance tuning.
Build a monitoring system.
Advanced Stage (5+ years)
Architectural design capability.
Automation and IaC.
Design disaster‑recovery solutions.
15.2 Recommended Learning Resources
Must‑Read Books
"Understanding the Linux Kernel"
"The Performance Handbook"
"Site Reliability Engineering: How Google Runs Production Systems"
"Linux Performance Tuning"
Online Resources
Linux Performance (Brendan Gregg’s blog)
High Scalability
SREWeekly Newsletter
All the above material can be obtained by scanning the QR code below.
Note: The QR codes in the original article are promotional and have been omitted from this technical translation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
