Operations 15 min read

Eight Real-World Online Failure Cases and Their Resolution Strategies

The article presents eight authentic production incidents—including JVM Full GC, memory leaks, idempotency flaws, cache avalanches, disk‑I/O thread blocking, MySQL deadlocks, DNS hijacking, and bandwidth exhaustion—detailing their causes, diagnostics, and practical remediation steps for engineers.

DevOps
DevOps
DevOps
Eight Real-World Online Failure Cases and Their Resolution Strategies

During technical interviews candidates are often asked about system failures; this article shares eight genuine online incident cases accumulated from fifteen years of internet R&D experience, aiming to help readers handle such questions and improve real‑world troubleshooting skills.

Fault 1: JVM frequent Full GC – Frequent Full GC can be triggered by memory leaks, dead loops, or especially large objects, often originating from massive database result sets. In one case a missing required parameter caused a query to return tens of thousands of rows. The problematic MyBatis mapper was:

<select id="selectOrders" resultType="com.***.Order" >

select * from user where 1=1
<if test=" orderID != null ">
and order_id = #{orderID}
</if >

<if test="userID !=null">
and user_id=#{userID}
</if >

<if test="startTime !=null">
and create_time >= #{createTime}
</if >

<if test="endTime !=null">
and create_time <= #{userID}
</if >

</select>

After analyzing heap dumps and correlating database I/O spikes, the team identified the large query as the root cause and fixed it by adding proper parameter validation, reducing GC time to about five minutes.

Fault 2: Memory leak – Unlike out‑of‑memory errors, memory leaks gradually increase usage without immediate crashes. A local cache storing all product data without expiration caused the JVM heap to grow until an alert fired. Using jmap and Eclipse MAT the leak was traced to the cache, and adding a 7‑day TTL resolved the issue.

Fault 3: Idempotency problem – Duplicate message consumption in a points service led to users receiving multiple credits. The fix was to introduce an idempotent check: before adding points, query a points‑record table for the order ID and only proceed if no record exists.

Fault 4: Cache avalanche – After a user‑system redesign, many cache entries expired simultaneously, causing a sudden surge of database queries and CPU spikes. The solution is to stagger cache TTLs by adding a random offset (e.g., 24 hours + 0‑3600 seconds) to each entry.

Fault 5: Disk I/O causing thread blocking – Intermittent slow responses were traced to threads blocked on synchronous logback writes. A shell script automatically captured jstack snapshots every five seconds, revealing the bottleneck; switching to asynchronous logging eliminated the blockage.

Fault 6: Database deadlock – A scheduled task updating orders by created_time (using a secondary index) and a manual batch cancel operation (using the primary key) acquired locks in opposite orders, leading to deadlock. Aligning lock acquisition order or processing cancellations one row at a time resolves the conflict.

update t_order set status = 'CANCELLED' where created_time > '2020-01-01 08:00:00' and created_time < '2020-01-01 10:00:00' and status = 'UNPAID' update t_order set status = 'CANCELLED' where id in (2, 3, 5) and status = 'UNPAID' Fault 7: DNS hijacking – An attack redirected domain resolution, causing product images to load from an advertisement server. Switching to HTTPS for CDN resources mitigated the hijack, and maintaining backup domains provides additional resilience. Fault 8: Bandwidth exhaustion – A promotional event caused a massive spike in QR‑code generation, saturating outbound bandwidth. Moving QR‑code generation to client‑side SDKs (Android, iOS, React) offloaded both bandwidth and CPU from the server.

JVMMemory LeakIdempotencyDNS hijackingFull GCCache AvalancheBandwidth ExhaustionDatabase Deadlock
DevOps
Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.