Operations 17 min read

7 Real-World Production Failures and Fast Diagnosis Techniques

The article shares seven authentic production incident cases—from JVM Full GC spikes and memory leaks to cache avalanches, disk I/O blocks, database deadlocks, DNS hijacking, and bandwidth exhaustion—detailing root causes, step‑by‑step troubleshooting methods, code snippets, and practical mitigation strategies for engineers.

dbaplus Community

Aug 11, 2020

7 Real-World Production Failures and Fast Diagnosis Techniques

Failure 1: JVM Frequent Full GC

Common triggers for frequent Full GC include memory leaks, infinite loops, and large objects, with large objects accounting for over 80% of cases. Large objects often originate from massive result sets in databases (MySQL, MongoDB), third‑party APIs, or oversized messages in queues. In the described incident, a POP service began rapid Full GC without any new deployment. Traditional heap dump analysis with jmap -dump:format=b,file=<filename> [pid] was too slow, so the team correlated a spike in database network I/O with the GC events, pinpointing a massive SQL query caused by missing mandatory parameters, leading to a query that returned tens of thousands of rows. The problematic MyBatis SQL used conditional if tests without proper validation, resulting in a query without orderID or userID filters.

<select id="selectOrders" resultType="com.***.Order">
  select * from user where 1=1
  <if test=" orderID != null ">
    and order_id = #{orderID}
  </if>
  <if test="userID !=null">
    and user_id=#{userID}
  </if>
  <if test="startTime !=null">
    and create_time >= #{createTime}
  </if>
  <if test="endTime !=null">
    and create_time <= #{userID}
  </if>
</select>

After splitting the SQL into two separate statements—one filtering by orderID and another by userID —the issue was resolved within five minutes.

Failure 2: Memory Leak

A memory leak occurs when objects are not released, causing gradual heap growth; unlike an out‑of‑memory error, the application may still run until the leak exhausts the heap. In this case, a custom local cache stored all product data without expiration, eventually filling the JVM heap. Monitoring with jmap and Eclipse MAT revealed the cache as the culprit. Adding a 7‑day TTL and restarting services eliminated the leak.

Failure 3: Idempotency Issue

During order completion, a message queue could deliver duplicate messages, causing the same order to award points multiple times. The fix was to introduce an idempotent check: before adding points, query a points‑record table for the order; only add points if no record exists. This pattern is essential for any retry‑prone operation, such as payment processing.

Failure 4: Cache Avalanche

When a cache’s expiration policy was removed during a refactor, all product data accumulated in the cache. A sudden cache miss burst caused massive traffic to hit the MySQL database, overwhelming CPU, I/O, and Redis. The solution is to stagger cache TTLs (e.g., 24 h + random 0‑3600 s) to avoid simultaneous expiration.

Failure 5: Disk I/O Causing Thread Blocking

Intermittent latency spikes were traced to threads blocked in Logback logging. A shell script was created to capture jstack output every five seconds, rotating logs after 20 000 files. Analyzing the logs revealed heavy logging as the bottleneck; switching to asynchronous logging resolved the issue.

#!/bin/bash
num=0
log="/tmp/jstack_thread_log/thread_info"
cd /tmp
if [ ! -d "jstack_thread_log" ]; then
  mkdir jstack_thread_log
fi
while ((num <= 10000)); do
  ID=`ps -ef | grep java | grep gaea | grep -v "grep" | awk '{print $2}'`
  if [ -n "$ID" ]; then
    jstack $ID >> ${log}
  fi
  num=$(( $num + 1 ))
  mod=$(( $num % 100 ))
  if [ $mod -eq 0 ]; then
    back=${log}${num}
    mv ${log} ${back}
  fi
  sleep 5
done

Failure 6: Database Deadlock

Two concurrent SQL operations—an hourly job canceling unpaid orders (locking created_time index then primary key) and a manual batch cancel (locking primary key directly)—acquired locks in opposite orders, causing a classic deadlock. Aligning lock acquisition order or processing cancellations one by one eliminates the deadlock.

update t_order set status='CANCELLED' where created_time > '2020-01-01 08:00:00' and created_time < '2020-01-01 10:00:00' and status='UNPAID';
update t_order set status='CANCELLED' where id in (2,3,5) and status='UNPAID';

Failure 7: DNS Hijacking

Improper HTTP URLs for CDN assets allowed DNS hijacking, causing product images to be replaced with ads. Switching to HTTPS for CDN resources mitigated the attack, though additional safeguards like backup domains are recommended.

Failure 8: Bandwidth Exhaustion

A promotional campaign caused a sudden surge in QR‑code generation, overwhelming network bandwidth because each QR code is an image. Moving QR‑code generation to the client side (Android, iOS, React SDKs) offloaded bandwidth and CPU from the server.

All cases illustrate practical troubleshooting techniques—monitoring, heap dumps, log analysis, SQL lock ordering, cache TTL design, and client‑side offloading—that can help engineers quickly resolve production incidents.

JVM Memory Leak Troubleshooting DNS Hijacking cache avalanche shell script Database Deadlock Production Incident

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.