Operations 7 min read

Why “High Availability” Often Fails: Lessons from a Messaging System Disaster

A real‑world incident with ActiveMQ’s high‑availability setup shows that focusing on component reliability without business‑level capacity planning, monitoring, and graceful degradation can cripple services, highlighting that true high availability must prioritize overall system and user experience.

Efficient Ops

Aug 9, 2021

Why “High Availability” Often Fails: Lessons from a Messaging System Disaster

Background

Recently a friend’s company suffered a major outage caused by using ActiveMQ’s high‑availability mode (M‑S architecture with double‑write ACK). During peak traffic the production side became congested, many requests failed to land, and data became chaotic.

Their application demanded zero message loss, a level higher than telecom standards, and they achieved it, but the service became unusable.

The issue originated from a senior manager who, after a test data disappearance in a non‑production environment, blamed the messaging system and escalated the problem to a “what if the power goes out” scenario.

The architecture team stayed up late, debating alternatives from Kafka to RocketMQ, from persistent databases to StoreHA, and finally settled on an extremely high‑availability solution that satisfied the manager’s requirement but reduced overall business throughput to less than a single database.

This is a typical “gun‑barrel” demand‑driven optimization failure.

Thoughts

High availability is often a false premise; despite familiar theories like CAP, many decision‑makers fall into this trap, and architects who make such mistakes are simply low‑level.

High availability means business availability, not just component availability.

For message queues, guaranteeing the queue’s survival and message reliability is insufficient. You must also consider producer and consumer topology, handling producer crashes, buffer processing, message overload in low‑throughput scenarios, consumer backlog, and dead‑letter handling.

If you have not performed capacity analysis, lack scaling mechanisms, and do not monitor critical points, the blame falls on you.

First ensure the business works, then worry about reliability.

When the business cannot run, even the most reliable components are useless. In a flash‑sale system you would use partially reliable caches for buffering rather than insisting on full reliability.

Main flow must not be blocked – use reliability degradation (circuit breaking).

For example, a checkout system should not wait for backend accounting logic before returning success; it should complete payment first and handle failures asynchronously, setting reasonable timeouts.

Another example: long‑running operations inside a transaction are dangerous.

If you can’t handle it, throttle instead of forcing through.

Rate limiting blocks requests at the outer layer, preventing harmful data corruption even if users see failures.

Data loss is unacceptable, but you must be able to recover.

Distributed systems rely on eventual consistency, often involving manual steps or customer‑service intervention. Common recovery methods include detailed logs, idempotent business logic, and retry or scan mechanisms.

Comprehensive logs for replay.

Idempotent operations ensuring at‑least‑once semantics.

Scanning and retry for abnormal data.

Don’t just talk about high availability.

Discussing HA is not impressive unless you analyze the proposer’s mindset and the consequences of technical choices; leadership may ignore two‑thirds of the real issues.

Distributed systems are complex; fixing one component does not solve the whole system.

End

Counter‑examples provoke useful thinking. Identify “gun‑barrel” versus “megaphone” demands, stick to professional judgment, and share this article with skeptics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Reliability messaging queues

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.