Operations 35 min read

Master Online Fault Diagnosis: Server, Java, MySQL, Redis & Network Tips

This guide walks you through common online failures—from disk full and high CPU on servers to Java Tomcat hangs, MySQL deadlocks, slow queries, Redis memory alerts, and network connection issues—providing step‑by‑step troubleshooting methods and practical commands to quickly locate and resolve problems.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Master Online Fault Diagnosis: Server, Java, MySQL, Redis & Network Tips

Common online faults are summarized and a systematic troubleshooting process is presented for server, Java application, database, Redis, network, and business‑log layers.

1. Server Layer

1.1 Disk

Symptoms: java.io.IOException: Disk space insufficient or similar alerts. Use df -h to view filesystem usage, identify the path with the largest consumption, then du -sh * to find the biggest directories and ls -lh to locate large log files. Delete or compress them to free space.

Related commands:

df : shows disk usage per filesystem.

du : displays directory size.

ls : lists file details.

1.2 CPU High

Symptoms: API latency spikes and monitoring alarms. Run top to find the process with the highest CPU usage (e.g., PID 14201). Use top -H -p <pid> to locate the hot thread, convert the thread ID to hexadecimal, and run jstack | grep nid=0x (or jstack -f) to pinpoint the offending Java class.

2. Application Layer

2.1 Tomcat Hang

Problem: A Tomcat node stops reporting metrics; logs show intermittent output and eventually an OutOfMemoryError: Java heap space. Capture a live heap dump with jmap -dump:live,format=b,file=/path/jmap.bin <pid>, compress it, and analyze with MAT. The analysis reveals a memory leak caused by excessive softItem objects (over 3 million) retained in a loop, leading to OOM.

3. MySQL

3.1 Deadlock

Problem:

Deadlock found when trying to get lock; try restarting transaction

. Check the transaction isolation level with SELECT @@tx_isolation. View the InnoDB status via SHOW ENGINE INNODB STATUS to see the lock wait graph. The deadlock is caused by two transactions holding shared (S) locks on the same row and each trying to acquire an exclusive (X) lock.

3.2 Slow SQL

Problem: TPS drops and queries time out. Use EXPLAIN to view the execution plan. If the plan shows type=ALL, the query is scanning the whole table; add appropriate indexes. If the query is blocked by locks, inspect information_schema.innodb_trx, innodb_locks, and innodb_lock_waits to identify blocking transactions.

3.3 Too Many Connections

Problem: Too many connections error. Increase max_connections (e.g., SET GLOBAL max_connections=1000) or kill idle sessions using SHOW PROCESSLIST and KILL <id>.

3.4 Related Knowledge

Brief overview of InnoDB storage engine, B+Tree index structure, MVCC, transaction isolation levels (READ COMMITTED, REPEATABLE READ), lock types (table lock, row lock, intention lock, gap lock) and their impact on concurrency.

4. Redis

4.1 Memory Alerts

When the server returns OOM command not allowed when used memory, set maxmemory (recommended ~75% of physical RAM) and choose an eviction policy such as allkeys-lru in redis.conf or via CONFIG SET maxmemory ….

4.2 Large Keys

Use tools like redis-cli --bigkeys or redis-memory-for-key to list the biggest keys and their memory consumption, then delete or redesign them.

4.3 Slow Commands

Configure slowlog threshold (microseconds) with slowlog-log-lower-than 1000 and view recent slow commands via SLOWLOG GET.

4.4 Connection Limits

Adjust maxclients in redis.conf or with CONFIG SET maxclients … to allow more concurrent connections.

4.5 Node Failure Recovery

When a cluster node shows disconnected, run CLUSTER FORGET <node_id> on each remaining node, then add a new node with redis-trib.rb add-node --slave --master-id … and restart the Redis process.

5. Network

5.1 Diagnosis Process

503 errors may be caused by SYN flood attacks (many SYN_RECV entries) or abnormal TCP states. Use netstat -nap | grep SYN_RECV to detect attacks, and

netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

to inspect connection states (TIME_WAIT, CLOSE_WAIT, etc.).

5.2 Knowledge

Explanation of TCP three‑way handshake, four‑way termination, the purpose of TIME_WAIT (2 MSL) to handle lost ACKs, and why SYN attacks are mitigated by monitoring half‑open connections.

6. Business Exception Logs

6.1 Problem

Business logs trigger alerts (e.g., java.lang.reflect.InvocationTargetException). Locate the traceId in the error log, then grep the same ID in the business log to obtain the full request flow.

6.2 Analysis

Correlate the extracted log lines with source code (e.g., class and method names) to identify the root cause and apply the appropriate fix.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

server monitoringnetwork debuggingredis optimizationonline troubleshooting
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.