Operations 17 min read

8 Real-World Production Failures and How to Diagnose Them Quickly

The article shares eight authentic production incident cases—from frequent JVM Full GC and memory leaks to cache avalanches, DNS hijacking, and database deadlocks—detailing their root causes, diagnostic steps, code snippets, and practical remediation strategies for engineers facing similar challenges.

ITPUB

Apr 7, 2021

8 Real-World Production Failures and How to Diagnose Them Quickly

Fault 1: Frequent Full GC in JVM

Full GC can be triggered by memory leaks, infinite loops, or large objects, with large objects accounting for over 80% of cases. Large objects often originate from databases (e.g., massive result sets), third‑party APIs, or oversized messages in queues. In a POP service without a new release, Full GC spiked, and standard heap dump analysis with jmap -dump:format=b,file=heap.bin [pid] and MAT proved too slow. By concurrently monitoring database network I/O, the team correlated the GC surge with a DB query that returned millions of rows because a required parameter was missing, causing a massive SELECT. Adding the missing condition fixed the issue within five minutes.

<select id="selectOrders" resultType="com.***.Order">
select * from user where 1=1
<if test=" orderID != null ">
and order_id = #{orderID}
</if>
<if test="userID !=null">
and user_id=#{userID}
</if>
<if test="startTime !=null">
and create_time >= #{createTime}
</if>
<if test="endTime !=null">
and create_time <= #{userID}
</if>
</select>

Fault 2: Memory Leak

A memory leak differs from an out‑of‑memory overflow: leaks gradually increase heap usage without immediate failure, eventually causing OOM when the heap limit is reached. Monitoring graphs show heap usage rising after each GC. The incident involved a custom local cache storing product data; after a refactor removed the 7‑day expiration, the cache grew to hold all products, exhausting heap space. Using jmap -dump:format=b,file=heap.bin [pid] and MAT revealed the cache as the culprit. Restoring expiration and restarting the nodes resolved the leak.

Fault 3: Idempotency Issue

In an e‑commerce points service, duplicate messages from the order system caused the same order to be credited multiple times. The fix introduced an idempotent record table: before adding points, the service checks whether a record for the order already exists; only absent records trigger a points addition. This pattern ensures that repeated operations produce the same final state, a principle also required for payment APIs.

Fault 4: Cache Avalanche

During a user‑system redesign, cached user profiles were bulk‑loaded into Redis with a uniform 24‑hour TTL. When the TTL expired simultaneously, all cache entries vanished, causing a massive surge of database queries and CPU/IO spikes, effectively a cache avalanche. Mitigation involves randomizing each entry’s expiration (e.g., 24 hours + 0‑3600 seconds) to stagger cache misses and avoid synchronized DB load.

Fault 5: Disk I/O Causing Thread Blocking

Intermittent response slowdowns lasting seconds were traced to threads blocked on synchronous logback I/O. An automated script was created to capture jstack snapshots every five seconds, storing up to 20 000 logs. Analysis of the logs revealed threads stuck in log output. Switching to asynchronous logging eliminated the blockage.

#!/bin/bash
num=0
log="/tmp/jstack_thread_log/thread_info"
cd /tmp
if [ ! -d "jstack_thread_log" ]; then
   mkdir jstack_thread_log
fi
while ((num <= 10000)); do
    ID=`ps -ef | grep java | grep gaea | grep -v "grep" | awk '{print $2}'`
    if [ -n "$ID" ]; then
        jstack $ID >> ${log}
    fi
    num=$(( $num + 1 ))
    mod=$(( $num % 100 ))
    if [ $mod -eq 0 ]; then
        back=${log}${num}
        mv ${log} ${back}
    fi
    sleep 5
done

Fault 6: Database Deadlock

MySQL InnoDB uses clustered primary keys and secondary indexes. A scheduled task cancels unpaid orders older than one hour, while a manual admin tool cancels specific orders. The scheduled task locks the created_time secondary index first, then the primary key in order 5→4→3→2. The manual task locks primary keys directly in order 2→3→5. The opposite lock order creates a classic deadlock. Resolving it requires consistent lock ordering or breaking batch cancellations into single‑row operations.

Fault 7: DNS Hijacking

DNS hijacking redirects domain resolution to malicious IPs, causing users to see unrelated ads or fail to reach the intended site. An example showed product images replaced by advertisement images after the CDN’s HTTP links were hijacked. Switching to HTTPS and maintaining backup domains mitigated the impact.

Fault 8: Bandwidth Exhaustion

During a massive promotion, server‑side QR‑code generation flooded the network, exhausting bandwidth and slowing page responses. The solution moved QR‑code generation to client‑side SDKs (Android, iOS, React), reducing server CPU and bandwidth consumption.

The shared cases stem from the author’s 15 years of e‑commerce backend experience and aim to help engineers handle similar production incidents efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JVM monitoring Cache Operations database Troubleshooting

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.