Operations 18 min read

7 Real‑World Production Failures and How to Diagnose Them Quickly

The article shares eight concrete production incidents—from JVM Full GC spikes and memory leaks to cache avalanches, deadlocks, DNS hijacking and bandwidth exhaustion—detailing their root causes, step‑by‑step diagnostics, code snippets, monitoring tricks and practical remediation measures for engineers.

dbaplus Community

Jan 21, 2021

7 Real‑World Production Failures and How to Diagnose Them Quickly

Fault 1: JVM Frequent Full GC

Full GC spikes often originate from a sudden influx of large objects, most commonly massive database result sets. In a production incident a POP service began rapid Full GC without any new deployment. Traditional heap‑dump analysis with jmap -dump:format=b,file=<file> <pid> and MAT was too slow.

By concurrently monitoring the database server’s network I/O, a clear I/O spike aligned with the GC timeline, indicating a large query. The DBA identified the offending SQL, which lacked required filter parameters and effectively executed: SELECT * FROM user WHERE 1=1 The MyBatis mapper fragment that produced the query was:

<select id="selectOrders" resultType="com.xxx.Order">
  select * from user where 1=1
  <if test="orderID != null"> and order_id = #{orderID}</if>
  <if test="userID != null"> and user_id = #{userID}</if>
  <if test="startTime != null"> and create_time >= #{startTime}</if>
  <if test="endTime != null"> and create_time <= #{endTime}</if>
</select>

Adding the missing parameters reduced the result set to a few rows and eliminated the Full GC within minutes.

Fault 2: Memory Leak in a Custom Local Cache

A locally‑implemented cache for product data originally set a 7‑day TTL on each entry. After a refactor the TTL was removed, causing the cache to retain all product records and eventually exhaust heap memory.

JVM monitoring showed a steady increase in heap usage after each GC, confirming a leak. A heap dump obtained with jmap -dump:format=b,file=heap.hprof <pid> was analyzed with Eclipse MAT, revealing millions of cached product objects.

Resolution: re‑introduce a TTL (e.g., 7 days) for cache entries and restart the affected nodes.

Fault 3: Idempotency Issue in a Points Service

Duplicate order‑completion messages caused the same order to be credited multiple times. The fix was to make the points addition idempotent:

Before adding points, query a points_record table for the order ID.

If no record exists, insert a new record and add the points; otherwise skip the addition.

This pattern guarantees that repeated processing of the same message does not affect the final state.

Fault 4: Cache Avalanche (Mass Expiration)

A cache‑initialization job refreshed all user data at once, giving every entry the same expiration timestamp. When the timestamps expired, a flood of cache misses hit MySQL, causing CPU spikes, high I/O wait, and a collapse of Redis hit rate.

Mitigation strategy:

Assign each entry a base TTL (e.g., 24 h) plus a random offset (0–3600 s).

This staggers expirations, preventing a simultaneous surge of database queries.

Fault 5: Disk I/O Causing Thread Blocking in Logback

Intermittent slow responses were traced to threads blocked while writing Logback logs. Manual jstack collection was impractical, so an automated shell script was created to capture a thread dump every 5 seconds and rotate logs.

#!/bin/bash
num=0
log="/tmp/jstack_thread_log/thread_info"
cd /tmp
if [ ! -d "jstack_thread_log" ]; then
  mkdir jstack_thread_log
fi
while ((num <= 10000)); do
  ID=$(ps -ef | grep java | grep gaea | grep -v "grep" | awk '{print $2}')
  if [ -n "$ID" ]; then
    jstack $ID >> ${log}
  fi
  num=$((num + 1))
  mod=$((num % 100))
  if [ $mod -eq 0 ]; then
    back=${log}$num
    mv $log $back
  fi
  sleep 5
done

After reducing Logback verbosity and switching to asynchronous logging, the thread‑blocking disappeared.

Fault 6: Database Deadlock Due to Lock‑Order Inversion

Two concurrent operations on t_order caused a deadlock:

Hourly scheduled task: cancels unpaid orders within a time window, locking the non‑clustered index on created_time first, then the primary key.

Admin batch‑cancel: directly updates rows by primary key, locking the primary key first.

SQL for the scheduled task:

UPDATE t_order SET status='CANCELLED'
WHERE created_time > '2020-01-01 08:00:00'
  AND created_time < '2020-01-01 10:00:00'
  AND status='UNPAID';

SQL for the admin batch cancel:

UPDATE t_order SET status='CANCELLED'
WHERE id IN (2,3,5) AND status='UNPAID';

Because the two statements acquire locks in opposite orders, InnoDB detects a deadlock. Resolution approaches:

Standardize lock acquisition order (e.g., always lock the primary key first).

Rewrite batch updates as a series of single‑row statements to avoid overlapping lock ranges.

Fault 7: DNS Hijacking

An attacker compromised DNS resolution, causing requests to www.baidu.com to be redirected to malicious IPs. In the incident, product images served over HTTP from a CDN were replaced with advertisement images.

Mitigations:

Serve all CDN resources over HTTPS to prevent man‑in‑the‑middle tampering.

Deploy DNSSEC where supported.

Maintain backup domains that can be switched to instantly.

Fault 8: Bandwidth Exhaustion from Server‑Side QR‑Code Generation

A promotional campaign caused a sudden surge in QR‑code generation. Each QR code was generated server‑side as an image, saturating outbound bandwidth and slowing the entire site.

Solution: move QR‑code generation to the client (Android, iOS, or React SDKs). This offloads both bandwidth and CPU consumption to the user’s device.

Result: bandwidth usage returned to normal and server CPU load dropped significantly.

These eight cases illustrate practical troubleshooting techniques: monitoring external dependencies, rapid heap‑dump analysis, lock‑order awareness, idempotent design, staggered cache expiration, automated thread‑dump collection, and security hardening.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JVM memory-leak full GC Cache Avalanche Database Deadlock

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.