Operations 30 min read

Master Linux Outages: Proven Troubleshooting Strategies & Solutions for Common Ops Failures

This comprehensive guide walks you through systematic troubleshooting methods for frequent Linux operational incidents—covering CPU spikes, memory leaks, disk I/O bottlenecks, network glitches, database issues, container problems, and proactive monitoring—so you can quickly pinpoint root causes and restore services.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Linux Outages: Proven Troubleshooting Strategies & Solutions for Common Ops Failures

Learn from Linux Failures: A Comprehensive Collection of Common Ops Fault Diagnosis and Solutions

At 3 a.m. an urgent phone call rings, and the monitoring alarm shows a service outage. As an operations engineer, this scenario is familiar. This article shares the troubleshooting mindset and solutions distilled from thousands of production incidents, helping you quickly locate problems and restore services during critical moments.

Introduction: Why Troubleshooting Ability Determines Your Ops Ceiling

Excellent operations versus average operations does not depend on the number of tools you know, but on the thinking process and efficiency when a fault occurs.

This article uses real cases to systematically share the most common Linux operational fault scenarios and their troubleshooting methods. Whether you are a newcomer or a veteran, you will find valuable insights.

1. Build a Systematic Fault‑Diagnosis Thinking Model

1.1 Golden Diagnosis Rule: the STEP Model

Before handling any incident, remember this model:

S ymptom: accurately describe the observed issue.

T ime: determine when the fault occurred.

E nvironment: understand the system environment and recent changes.

P roblem: locate the root cause and resolve it.

1.2 Fault‑Priority Determination Matrix

Urgency × Impact = Handling Priority

P0: Core business completely down (immediate handling)
P1: Core business partially down (handle within 15 minutes)
P2: Non‑core impact (handle within 1 hour)
P3: Isolated user impact (handle in scheduled window)

2. CPU‑Related Fault Diagnosis in Practice

2.1 Diagnosing 100% CPU Usage

Symptom: System response is slow, and top shows CPU usage constantly at 100%.

Investigation Steps:

# 1. Find the process with the highest CPU consumption
 top -c
# Press 'P' to sort by CPU and note the PID

# 2. View thread‑level CPU usage for the process
 top -Hp [PID]

# 3. Get stack trace of the problematic thread
 # Convert thread ID to hexadecimal
 printf "%x
" [ThreadID]

# 4. If it is a Java process, print the thread stack
 jstack [PID] | grep -A 20 [hexThreadID]

# 5. Use perf to analyze CPU hotspots
 perf top -p [PID]
 perf record -p [PID] -g -- sleep 30
 perf report

Real‑World Case: During a promotion period, an e‑commerce platform experienced a CPU spike.

Top showed a Java process consuming 800% CPU (8 cores fully loaded). jstack revealed many threads blocked on a synchronized method.

Code review found that log writing used a synchronized lock.

Solution: switch to an asynchronous logging framework; CPU dropped to 30% immediately.

Takeaways:

Common causes of high CPU: infinite loops, regex backtracking, frequent GC, lock contention.

Adopt tools like perf and flame graphs.

Establish CPU usage baselines to detect anomalies early.

2.2 Persistent High Load Average

Symptom: Load Average exceeds 30+ while CPU usage is only 50%.

# 1. Check load and process status
 uptime
 vmstat 1 5

# 2. Analyze process state distribution
 ps aux | awk '{print $8}' | sort | uniq -c

# 3. Find processes in D (uninterruptible sleep) state
 ps aux | grep " D "

# 4. Analyze I/O wait
 iostat -x 1 5
 iotop -o

# 5. Inspect system calls
 strace -c -p [PID]

Root Cause Analysis: High load with low CPU usually indicates I/O wait or lock wait.

3. Deep Dive into Memory Faults

3.1 Full Process for Memory Leak Diagnosis

Symptom: Memory usage keeps growing until an OOM kill occurs.

# 1. View memory usage trends
 free -h
 cat /proc/meminfo

# 2. Find the process consuming the most memory
 ps aux --sort=-%mem | head

# 3. Inspect process memory mapping
 pmap -x [PID]
 cat /proc/[PID]/status | grep -i vm

# 4. Memory leak detection for C/C++ programs
 valgrind --leak-check=full --show-leak-kinds=all ./program

# 5. Java memory analysis
 jmap -heap [PID]
 jmap -histo:live [PID]
 jmap -dump:format=b,file=heap.bin [PID]
 # Use MAT or jhat to analyze the dump

Real‑World Case: Redis memory abnormal growth.

# 1. Check Redis memory info
 redis-cli info memory

# 2. Analyze large keys
 redis-cli --bigkeys

# 3. Sample key distribution
 redis-cli --memkeys

# 4. Problem discovered: a hash key with 10 million fields

# 5. Solution: split the large key and set reasonable expiration times

3.2 Cache Hit‑Rate Optimization

Metrics:

# View system cache status
 free -h
 cat /proc/meminfo | grep -E "Cached|Buffer"

# View page cache hit rate
 sar -B 1 10

# Clear cache (use with caution in production)
 echo 1 > /proc/sys/vm/drop_caches   # page cache
 echo 2 > /proc/sys/vm/drop_caches   # dentries and inodes
 echo 3 > /proc/sys/vm/drop_caches   # all caches

4. Disk I/O Fault Handling

4.1 I/O Performance Bottleneck Identification

Symptom: Database response slows, I/O wait stays high.

# 1. View disk I/O statistics
 iostat -x 1 10
 # Focus on %util, await, r/s, w/s

# 2. Find processes performing I/O
 iotop -o -P

# 3. Trace specific I/O operations
 blktrace -d /dev/sda -o trace
 blkparse trace.* | head -n 100

# 4. View filesystem cache
 slabtop

# 5. Analyze process I/O patterns
 pidstat -d 1 10

4.2 Quick Disk Space Cleanup

# 1. Find large files quickly
 du -h / 2>/dev/null | grep '[0-9]G' | sort -rn

# 2. Find recently modified large files
 find / -type f -mtime -1 -size +100M 2>/dev/null

# 3. Find deleted but still‑held files
 lsof | grep deleted

# 4. Check inode usage
 df -i

# 5. Find directories with most inode consumption
 for i in /*; do echo $i; find $i -type f | wc -l; done

Classic Case: Log file deleted but space not reclaimed because a process still writes to the deleted file.

# Problem: rm removed a large log, but df shows no space freed
# Reason: a process is still writing to the deleted file

# Solution 1: locate the holding process and restart
 lsof | grep deleted
 kill -USR1 [nginx_pid]   # reopen log file

# Solution 2: truncate the file instead of deleting
 > /var/log/large.log   # recommended

5. Network Fault Diagnosis Techniques

5.1 Network Connection Fault Localization

# 1. Check connectivity
 ping -c 4 targetIP
 traceroute targetIP
 mtr targetIP

# 2. Check DNS resolution
 nslookup domain.com
 dig +trace domain.com

# 3. Check port connectivity
 telnet IP PORT
 nc -zv IP PORT

# 4. View connection state
 ss -antp
 netstat -antp | awk '{print $6}' | sort | uniq -c

# 5. Capture packets for analysis
 tcpdump -i eth0 -nn -s0 -v port 80
 tcpdump -i eth0 -w capture.pcap   # analyze later with Wireshark

5.2 Excessive TIME_WAIT Connections

Symptom: Large number of TIME_WAIT sockets exhaust ports.

# View TIME_WAIT count
 ss -ant | grep TIME-WAIT | wc -l

# Tune kernel parameters
 cat >> /etc/sysctl.conf <<EOF
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 0   # not recommended to enable
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_fin_timeout = 30
EOF

sysctl -p

5.3 Network Performance Optimization

# 1. View traffic
 iftop -i eth0
 nethogs

# 2. View errors and packet loss
 ip -s link show eth0
 ethtool -S eth0

# 3. Optimize socket buffers
 echo 'net.core.rmem_max = 134217728' >> /etc/sysctl.conf
 echo 'net.core.wmem_max = 134217728' >> /etc/sysctl.conf
 echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' >> /etc/sysctl.conf
 echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' >> /etc/sysctl.conf

6. Process and Service Fault Handling

6.1 Zombie Process Handling

# Find zombie processes
 ps aux | grep defunct
 ps aux | awk '$8 ~ /^Z/ { print }'

# Find parent PIDs of zombies
 ps -ef | grep defunct | grep -v grep | awk '{print $3}' | sort | uniq

# Remedy
 # 1. Send SIGCHLD to the parent
 kill -SIGCHLD [parentPID]
 # 2. If ineffective, restart the parent service
 systemctl restart [serviceName]

6.2 Service Startup Failure Diagnosis

# 1. Check service status and logs
 systemctl status service_name
 journalctl -u service_name -n 50

# 2. Check port occupation
 ss -tlnp | grep :port
 lsof -i :port

# 3. Verify configuration syntax
 nginx -t
 httpd -t
 mysql --help --verbose | grep -A 1 "Default options"

# 4. Check permissions
 ls -la /path/to/service
 namei -l /path/to/service/file

# 5. Check SELinux
 getenforce
 setenforce 0   # temporary disable for testing
 ausearch -m AVC -ts recent

7. System Log Analysis Techniques

7.1 Efficient Log‑Analysis Commands

# 1. Quickly locate error logs
 grep -E "ERROR|WARN|FATAL" /var/log/app.log | tail -100

# 2. Count error frequency
 awk '/ERROR/ {print $1,$2}' app.log | uniq -c | sort -rn

# 3. Analyze access logs – status code distribution
 awk '{print $9}' access.log | sort | uniq -c | sort -rn

# 4. List slow requests (>1 s)
 awk '$NF > 1000 {print $7,$NF}' access.log | sort -k2 -rn | head

# 5. Real‑time monitoring
 tail -f /var/log/messages | grep --line-buffered ERROR

# 6. Merge compressed logs for analysis
 zcat /var/log/app.log*.gz | grep ERROR | less

7.2 Using journalctl Efficiently

# View logs for a specific time range
 journalctl --since "2024-01-01 00:00:00" --until "2024-01-01 01:00:00"

# View logs of a specific service
 journalctl -u nginx.service -f

# View kernel logs
 journalctl -k

# View logs of a specific priority
 journalctl -p err..emerg

# Export logs for analysis
 journalctl -u service_name -o json > service.json

8. Database Fault Handling

8.1 MySQL Performance Issue Diagnosis

# 1. View slow‑query settings
 show variables like 'slow_query%';
 show variables like 'long_query_time';

# 2. View currently running SQL
 show processlist;
 show full processlist;

# 3. Check lock waits
 show engine innodb status\G
 SELECT * FROM information_schema.INNODB_LOCKS;
 SELECT * FROM information_schema.INNODB_LOCK_WAITS;

# 4. View table lock status
 show open tables where in_use > 0;

# 5. Analyze execution plan
 explain select * from table where condition;

Deadlock Handling Example:

# View recent deadlock information
 SHOW ENGINE INNODB STATUS\G

# Find transactions holding locks
 SELECT * FROM information_schema.INNODB_TRX\G

# Kill the offending transaction
 KILL [processID];

8.2 Redis Fault Handling

# 1. Check Redis status
 redis-cli ping
 redis-cli info

# 2. View slow queries
 redis-cli slowlog get 10

# 3. List client connections
 redis-cli client list

# 4. View memory usage
 redis-cli info memory

# 5. Emergency cache clear (use with caution)
 redis-cli FLUSHDB   # clear current DB
 redis-cli FLUSHALL  # clear all DBs (dangerous)

9. Containerized Environment Fault Diagnosis

9.1 Docker Container Issues

# 1. View container status
 docker ps -a
 docker inspect [containerID]

# 2. View container logs
 docker logs --tail 100 -f [containerID]

# 3. Enter container for debugging
 docker exec -it [containerID] /bin/bash

# 4. View container resource usage
 docker stats
 docker top [containerID]

# 5. Check container networking
 docker network ls
 docker port [containerID]

9.2 Kubernetes Fault Diagnosis

# 1. View pod status
 kubectl get pods -o wide
 kubectl describe pod [pod-name]

# 2. View pod logs
 kubectl logs -f [pod-name]
 kubectl logs -f [pod-name] -c [container-name]

# 3. Enter pod for debugging
 kubectl exec -it [pod-name] -- /bin/bash

# 4. View events
 kubectl get events --sort-by=.metadata.creationTimestamp

# 5. View resource usage
 kubectl top nodes
 kubectl top pods

10. Automated Fault‑Handling Scripts

10.1 CPU Monitoring Alert Script

#!/bin/bash
# CPU monitoring alert script

CPU_THRESHOLD=80
LOAD_THRESHOLD=10

# Get CPU usage
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}')

# Alert if thresholds exceeded
if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then
  echo "Warning: High CPU usage: ${CPU_USAGE}%"
  curl -X POST "https://your-webhook-url" \
       -H "Content-Type: application/json" \
       -d "{\"text\":\"CPU alert: usage ${CPU_USAGE}%\"}"
fi

if (( $(echo "$LOAD_AVG > $LOAD_THRESHOLD" | bc -l) )); then
  echo "Warning: High system load: ${LOAD_AVG}"
  ps aux --sort=-%cpu | head -10 > /tmp/high_cpu_processes.log
fi

10.2 Disk Space Auto‑Cleanup Script

#!/bin/bash
# Disk space auto‑cleanup script

THRESHOLD=80
LOG_DIR="/var/log"
DAYS_TO_KEEP=7

# Check disk usage
USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')

if [ $USAGE -gt $THRESHOLD ]; then
  echo "Disk usage exceeds ${THRESHOLD}%, starting cleanup..."

  # Clean old logs
  find $LOG_DIR -name "*.log" -mtime +$DAYS_TO_KEEP -delete
  find $LOG_DIR -name "*.gz" -mtime +$DAYS_TO_KEEP -delete

  # Clean temporary files
  find /tmp -type f -mtime +7 -delete
  find /var/tmp -type f -mtime +7 -delete

  # Clean package manager cache
  yum clean all 2>/dev/null || apt-get clean 2>/dev/null

  # Vacuum journal logs older than 7 days
  journalctl --vacuum-time=7d

  echo "Cleanup complete, current disk usage: $(df -h / | awk 'NR==2 {print $5}')"
fi

11. Fault Prevention and Monitoring Construction

11.1 Building a Complete Monitoring System

Key Monitoring Indicators :

System Level

CPU usage, Load Average

Memory usage, Swap usage

Disk I/O, Disk utilization

Network traffic, Connection count

Application Level

Response time (P50/P90/P99)

Error rate, success rate

QPS/TPS

Business‑specific metrics

Alert Strategy

# Prometheus alert rule example
 groups:
 - name: system_alerts
   rules:
   - alert: HighCPUUsage
     expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
     for: 5m
     labels:
       severity: warning
     annotations:
       summary: "CPU usage too high"
       description: "CPU usage exceeds 80% for more than 5 minutes"

11.2 Fault Drills and Playbooks

Regular Fault Drills:

Simulate CPU‑max scenarios.

Simulate memory leaks.

Simulate disk failures.

Simulate network partitions.

Playbook Template:

## Fault Playbook Template
### 1. Fault Type: [Specific fault]
### 2. Fault Level: [P0/P1/P2/P3]
### 3. Impact Scope: [Affected services and users]
### 4. Handling Process:
  1. Step one: [Action]
  2. Step two: [Action]
  3. Step three: [Action]
### 5. Rollback Plan: [How to rollback]
### 6. Owner: [Contact information]

12. Performance Optimization Best Practices

12.1 Kernel Parameter Tuning

# /etc/sysctl.conf optimizations
# Network
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_tw_reuse = 1

# Memory
vm.swappiness = 10
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

# Filesystem
fs.file-max = 2097152
fs.nr_open = 2097152

12.2 Application‑Level Optimization Suggestions

Database Optimization

Use indexes wisely.

Avoid N+1 queries.

Employ connection pools.

Implement read/write splitting.

Cache Strategy

Multi‑level cache architecture.

Cache pre‑warming.

Cache update policies.

Prevent cache avalanche.

Code Optimization

Asynchronous processing.

Batch operations.

Connection reuse.

Resource pooling.

13. Operations Tools Recommendation

13.1 Essential Command‑Line Tools

htop – friendly top.

iotop – I/O monitor.

iftop – network traffic monitor.

sysstat – system statistics suite.

dstat – comprehensive monitoring.

strace – system‑call tracing.

ltrace – library‑call tracing.

perf – performance analysis.

mtr – network diagnosis.

nmap – port scanning.

tcpdump – packet capture.

ss – socket statistics.

13.2 Monitoring Platform Recommendations

Open‑Source Solutions

Prometheus + Grafana

Zabbix

ELK Stack

OpenFalcon

Commercial Solutions

Datadog

New Relic

Alibaba Cloud ARMS

Tencent Cloud Monitor

14. Real‑World Fault Cases

Case 1: Double‑11 E‑Commerce Platform Outage

Background: Traffic surged tenfold during Double‑11, causing system collapse.

Symptoms:

Users cannot place orders.

Page load times timeout.

Database connection pool exhausted.

Investigation:

Detected database connections hitting the limit.

Slow‑query logs showed many full‑table scans.

Identified hotspot queries lacking indexes.

Added missing indexes; connection count returned to normal.

Lessons Learned:

Load testing must reflect real‑world scenarios.

Implement SQL review mechanisms.

Set reasonable monitoring thresholds.

Case 2: Redis Cache Avalanche Causing Service Collapse

Failure Process:

00:00 – A Redis node crashes
00:01 – Massive requests fall back to the database
00:03 – Database CPU spikes to 100%
00:05 – Entire service becomes unavailable

Resolution:

Urgently scale the database.

Apply traffic throttling and degradation.

Repair the failed Redis node.

Improve cache strategy with random expiration times.

15. Continuous Learning and Growth Advice

15.1 Technical Growth Roadmap

Foundation Stage (0‑2 years)

Master Linux basic commands.

Understand system fundamentals.

Learn basic fault handling.

Intermediate Stage (2‑5 years)

Dive deep into kernel internals.

Master performance tuning.

Build a monitoring system.

Advanced Stage (5+ years)

Architectural design capability.

Automation and IaC.

Design disaster‑recovery solutions.

15.2 Recommended Learning Resources

Must‑Read Books

"Understanding the Linux Kernel"

"The Performance Handbook"

"Site Reliability Engineering: How Google Runs Production Systems"

"Linux Performance Tuning"

Online Resources

Linux Performance (Brendan Gregg’s blog)

High Scalability

SREWeekly Newsletter

All the above material can be obtained by scanning the QR code below.

Note: The QR codes in the original article are promotional and have been omitted from this technical translation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.