7 Real-World Production Failures and Fast Diagnosis Techniques
The article shares seven authentic production incident cases—from JVM Full GC spikes and memory leaks to cache avalanches, disk I/O blocks, database deadlocks, DNS hijacking, and bandwidth exhaustion—detailing root causes, step‑by‑step troubleshooting methods, code snippets, and practical mitigation strategies for engineers.
Failure 1: JVM Frequent Full GC
Common triggers for frequent Full GC include memory leaks, infinite loops, and large objects, with large objects accounting for over 80% of cases. Large objects often originate from massive result sets in databases (MySQL, MongoDB), third‑party APIs, or oversized messages in queues. In the described incident, a POP service began rapid Full GC without any new deployment. Traditional heap dump analysis with jmap -dump:format=b,file=<filename> [pid] was too slow, so the team correlated a spike in database network I/O with the GC events, pinpointing a massive SQL query caused by missing mandatory parameters, leading to a query that returned tens of thousands of rows. The problematic MyBatis SQL used conditional if tests without proper validation, resulting in a query without orderID or userID filters.
<select id="selectOrders" resultType="com.***.Order">
select * from user where 1=1
<if test=" orderID != null ">
and order_id = #{orderID}
</if>
<if test="userID !=null">
and user_id=#{userID}
</if>
<if test="startTime !=null">
and create_time >= #{createTime}
</if>
<if test="endTime !=null">
and create_time <= #{userID}
</if>
</select>After splitting the SQL into two separate statements—one filtering by orderID and another by userID —the issue was resolved within five minutes.
Failure 2: Memory Leak
A memory leak occurs when objects are not released, causing gradual heap growth; unlike an out‑of‑memory error, the application may still run until the leak exhausts the heap. In this case, a custom local cache stored all product data without expiration, eventually filling the JVM heap. Monitoring with jmap and Eclipse MAT revealed the cache as the culprit. Adding a 7‑day TTL and restarting services eliminated the leak.
Failure 3: Idempotency Issue
During order completion, a message queue could deliver duplicate messages, causing the same order to award points multiple times. The fix was to introduce an idempotent check: before adding points, query a points‑record table for the order; only add points if no record exists. This pattern is essential for any retry‑prone operation, such as payment processing.
Failure 4: Cache Avalanche
When a cache’s expiration policy was removed during a refactor, all product data accumulated in the cache. A sudden cache miss burst caused massive traffic to hit the MySQL database, overwhelming CPU, I/O, and Redis. The solution is to stagger cache TTLs (e.g., 24 h + random 0‑3600 s) to avoid simultaneous expiration.
Failure 5: Disk I/O Causing Thread Blocking
Intermittent latency spikes were traced to threads blocked in Logback logging. A shell script was created to capture jstack output every five seconds, rotating logs after 20 000 files. Analyzing the logs revealed heavy logging as the bottleneck; switching to asynchronous logging resolved the issue.
#!/bin/bash
num=0
log="/tmp/jstack_thread_log/thread_info"
cd /tmp
if [ ! -d "jstack_thread_log" ]; then
mkdir jstack_thread_log
fi
while ((num <= 10000)); do
ID=`ps -ef | grep java | grep gaea | grep -v "grep" | awk '{print $2}'`
if [ -n "$ID" ]; then
jstack $ID >> ${log}
fi
num=$(( $num + 1 ))
mod=$(( $num % 100 ))
if [ $mod -eq 0 ]; then
back=${log}${num}
mv ${log} ${back}
fi
sleep 5
doneFailure 6: Database Deadlock
Two concurrent SQL operations—an hourly job canceling unpaid orders (locking created_time index then primary key) and a manual batch cancel (locking primary key directly)—acquired locks in opposite orders, causing a classic deadlock. Aligning lock acquisition order or processing cancellations one by one eliminates the deadlock.
update t_order set status='CANCELLED' where created_time > '2020-01-01 08:00:00' and created_time < '2020-01-01 10:00:00' and status='UNPAID';
update t_order set status='CANCELLED' where id in (2,3,5) and status='UNPAID';Failure 7: DNS Hijacking
Improper HTTP URLs for CDN assets allowed DNS hijacking, causing product images to be replaced with ads. Switching to HTTPS for CDN resources mitigated the attack, though additional safeguards like backup domains are recommended.
Failure 8: Bandwidth Exhaustion
A promotional campaign caused a sudden surge in QR‑code generation, overwhelming network bandwidth because each QR code is an image. Moving QR‑code generation to the client side (Android, iOS, React SDKs) offloaded bandwidth and CPU from the server.
All cases illustrate practical troubleshooting techniques—monitoring, heap dumps, log analysis, SQL lock ordering, cache TTL design, and client‑side offloading—that can help engineers quickly resolve production incidents.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
