How to Rapidly Diagnose and Resolve Common Online Service Failures
This guide walks through practical troubleshooting steps for typical production incidents—including disk exhaustion, high CPU, Java OOM, MySQL deadlocks and slow queries, Redis memory alerts, network TCP issues, and business‑log analysis—providing concrete commands, diagrams and mitigation strategies for each layer.
Server layer
Disk full
When java.io.IOException: Disk space full appears, run: df -h Identify the mount with the highest usage, then drill down: du -sh * Optionally list large files: ls -lh CPU high
Use top to find the PID with the highest CPU usage, then inspect its threads: top -H -p <em>PID</em> Convert the thread ID to hexadecimal and query with jstack to locate the offending Java class.
Application layer
Tomcat hang / OOM
Detect OOM via logs, then capture a heap dump:
/data/program/jdk/bin/jmap -dump:live,format=b,file=/home/www/jmaplogs/jmap-8001-2.bin 18760Compress the .bin file, transfer it to a workstation and open it with Eclipse Memory Analyzer (MAT). Look for object types with unusually high instance counts (e.g., java.lang.Object[810325]) which often indicate unbounded collection growth.
MySQL layer
Deadlock
show variables like 'tx_isolation'; show engine innodb status;Examine the deadlock report in the output of show engine innodb status to identify the conflicting transactions and the locked rows.
Slow query
Enable the slow‑query log if it is not already on, then run: explain <your_sql>; Typical remedies:
Add missing indexes.
Rewrite the query to reduce full‑table scans.
If the slowness is caused by lock contention, inspect InnoDB lock tables:
select * from information_schema.innodb_trx; select * from information_schema.innodb_locks; select * from information_schema.innodb_lock_waits;Too many connections set global max_connections = 500; Identify and terminate idle or long‑running sessions:
show processlist | awk '{print $4}' | sort | uniq -c | sort -rn | head -10Redis layer
Memory alerts
Set a memory limit (typically 70‑75% of physical RAM) and an eviction policy:
config set maxmemory 4gb config set maxmemory-policy allkeys-lruFind large keys with the built‑in --bigkeys option or debug object: redis-cli --bigkeys Slow commands
Configure the slow‑log threshold and length:
config set slowlog-log-lower-than 1000 # microseconds config set slowlog-max-len 200Retrieve entries: slowlog get Node failure recovery
Identify the dead node: cluster nodes Remove it: redis-trib.rb del-node <dead_node_id> Clean its data files, restart the instance, then add it back as a replica:
redis-trib.rb add-node --slave --master-id <master_id> <new_node_ip:port> <existing_node_ip:port>Network layer
Connection spikes / SYN flood netstat -nap | grep SYN_RECV Check overall TCP state distribution:
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'Identify sources of excessive TIME_WAIT sockets:
netstat -n | grep TIME_WAIT | awk '{print $5}' | awk -F: '{print $1}' | sort | uniq -c | sort -nr | head -10Business log analysis
Typical log pattern includes timestamp, trace ID, and error code. To locate an exception, grep the error log for the stack trace, extract the traceId, then search the business log:
cat error.log | grep -n "java.lang.reflect.InvocationTargetException" cat biz.log | grep -n '489d71fe-67db-4f59-a916-33f25d35cab8'After pinpointing the relevant lines, trace back to the source code for deeper analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
