Operations 32 min read

How to Rapidly Diagnose and Resolve Common Online Service Failures

This guide walks through practical troubleshooting steps for typical production incidents—including disk exhaustion, high CPU, Java OOM, MySQL deadlocks and slow queries, Redis memory alerts, network TCP issues, and business‑log analysis—providing concrete commands, diagrams and mitigation strategies for each layer.

dbaplus Community
dbaplus Community
dbaplus Community
How to Rapidly Diagnose and Resolve Common Online Service Failures

Server layer

Disk full

When java.io.IOException: Disk space full appears, run: df -h Identify the mount with the highest usage, then drill down: du -sh * Optionally list large files: ls -lh CPU high

Use top to find the PID with the highest CPU usage, then inspect its threads: top -H -p <em>PID</em> Convert the thread ID to hexadecimal and query with jstack to locate the offending Java class.

Application layer

Tomcat hang / OOM

Detect OOM via logs, then capture a heap dump:

/data/program/jdk/bin/jmap -dump:live,format=b,file=/home/www/jmaplogs/jmap-8001-2.bin 18760

Compress the .bin file, transfer it to a workstation and open it with Eclipse Memory Analyzer (MAT). Look for object types with unusually high instance counts (e.g., java.lang.Object[810325]) which often indicate unbounded collection growth.

MySQL layer

Deadlock

show variables like 'tx_isolation';
show engine innodb status;

Examine the deadlock report in the output of show engine innodb status to identify the conflicting transactions and the locked rows.

Slow query

Enable the slow‑query log if it is not already on, then run: explain <your_sql>; Typical remedies:

Add missing indexes.

Rewrite the query to reduce full‑table scans.

If the slowness is caused by lock contention, inspect InnoDB lock tables:

select * from information_schema.innodb_trx;
select * from information_schema.innodb_locks;
select * from information_schema.innodb_lock_waits;

Too many connections set global max_connections = 500; Identify and terminate idle or long‑running sessions:

show processlist | awk '{print $4}' | sort | uniq -c | sort -rn | head -10

Redis layer

Memory alerts

Set a memory limit (typically 70‑75% of physical RAM) and an eviction policy:

config set maxmemory 4gb
config set maxmemory-policy allkeys-lru

Find large keys with the built‑in --bigkeys option or debug object: redis-cli --bigkeys Slow commands

Configure the slow‑log threshold and length:

config set slowlog-log-lower-than 1000   # microseconds
config set slowlog-max-len 200

Retrieve entries: slowlog get Node failure recovery

Identify the dead node: cluster nodes Remove it: redis-trib.rb del-node <dead_node_id> Clean its data files, restart the instance, then add it back as a replica:

redis-trib.rb add-node --slave --master-id <master_id> <new_node_ip:port> <existing_node_ip:port>

Network layer

Connection spikes / SYN flood netstat -nap | grep SYN_RECV Check overall TCP state distribution:

netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

Identify sources of excessive TIME_WAIT sockets:

netstat -n | grep TIME_WAIT | awk '{print $5}' | awk -F: '{print $1}' | sort | uniq -c | sort -nr | head -10

Business log analysis

Typical log pattern includes timestamp, trace ID, and error code. To locate an exception, grep the error log for the stack trace, extract the traceId, then search the business log:

cat error.log | grep -n "java.lang.reflect.InvocationTargetException"
cat biz.log | grep -n '489d71fe-67db-4f59-a916-33f25d35cab8'

After pinpointing the relevant lines, trace back to the source code for deeper analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Operationsnetwork
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.