Master Online Fault Diagnosis: Server, Java, MySQL, Redis & Network Tips
This guide walks you through common online failures—from disk full and high CPU on servers to Java Tomcat hangs, MySQL deadlocks, slow queries, Redis memory alerts, and network connection issues—providing step‑by‑step troubleshooting methods and practical commands to quickly locate and resolve problems.
Common online faults are summarized and a systematic troubleshooting process is presented for server, Java application, database, Redis, network, and business‑log layers.
1. Server Layer
1.1 Disk
Symptoms: java.io.IOException: Disk space insufficient or similar alerts. Use df -h to view filesystem usage, identify the path with the largest consumption, then du -sh * to find the biggest directories and ls -lh to locate large log files. Delete or compress them to free space.
Related commands:
df : shows disk usage per filesystem.
du : displays directory size.
ls : lists file details.
1.2 CPU High
Symptoms: API latency spikes and monitoring alarms. Run top to find the process with the highest CPU usage (e.g., PID 14201). Use top -H -p <pid> to locate the hot thread, convert the thread ID to hexadecimal, and run jstack | grep nid=0x (or jstack -f) to pinpoint the offending Java class.
2. Application Layer
2.1 Tomcat Hang
Problem: A Tomcat node stops reporting metrics; logs show intermittent output and eventually an OutOfMemoryError: Java heap space. Capture a live heap dump with jmap -dump:live,format=b,file=/path/jmap.bin <pid>, compress it, and analyze with MAT. The analysis reveals a memory leak caused by excessive softItem objects (over 3 million) retained in a loop, leading to OOM.
3. MySQL
3.1 Deadlock
Problem:
Deadlock found when trying to get lock; try restarting transaction. Check the transaction isolation level with SELECT @@tx_isolation. View the InnoDB status via SHOW ENGINE INNODB STATUS to see the lock wait graph. The deadlock is caused by two transactions holding shared (S) locks on the same row and each trying to acquire an exclusive (X) lock.
3.2 Slow SQL
Problem: TPS drops and queries time out. Use EXPLAIN to view the execution plan. If the plan shows type=ALL, the query is scanning the whole table; add appropriate indexes. If the query is blocked by locks, inspect information_schema.innodb_trx, innodb_locks, and innodb_lock_waits to identify blocking transactions.
3.3 Too Many Connections
Problem: Too many connections error. Increase max_connections (e.g., SET GLOBAL max_connections=1000) or kill idle sessions using SHOW PROCESSLIST and KILL <id>.
3.4 Related Knowledge
Brief overview of InnoDB storage engine, B+Tree index structure, MVCC, transaction isolation levels (READ COMMITTED, REPEATABLE READ), lock types (table lock, row lock, intention lock, gap lock) and their impact on concurrency.
4. Redis
4.1 Memory Alerts
When the server returns OOM command not allowed when used memory, set maxmemory (recommended ~75% of physical RAM) and choose an eviction policy such as allkeys-lru in redis.conf or via CONFIG SET maxmemory ….
4.2 Large Keys
Use tools like redis-cli --bigkeys or redis-memory-for-key to list the biggest keys and their memory consumption, then delete or redesign them.
4.3 Slow Commands
Configure slowlog threshold (microseconds) with slowlog-log-lower-than 1000 and view recent slow commands via SLOWLOG GET.
4.4 Connection Limits
Adjust maxclients in redis.conf or with CONFIG SET maxclients … to allow more concurrent connections.
4.5 Node Failure Recovery
When a cluster node shows disconnected, run CLUSTER FORGET <node_id> on each remaining node, then add a new node with redis-trib.rb add-node --slave --master-id … and restart the Redis process.
5. Network
5.1 Diagnosis Process
503 errors may be caused by SYN flood attacks (many SYN_RECV entries) or abnormal TCP states. Use netstat -nap | grep SYN_RECV to detect attacks, and
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'to inspect connection states (TIME_WAIT, CLOSE_WAIT, etc.).
5.2 Knowledge
Explanation of TCP three‑way handshake, four‑way termination, the purpose of TIME_WAIT (2 MSL) to handle lost ACKs, and why SYN attacks are mitigated by monitoring half‑open connections.
6. Business Exception Logs
6.1 Problem
Business logs trigger alerts (e.g., java.lang.reflect.InvocationTargetException). Locate the traceId in the error log, then grep the same ID in the business log to obtain the full request flow.
6.2 Analysis
Correlate the extracted log lines with source code (e.g., class and method names) to identify the root cause and apply the appropriate fix.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
