Operations 14 min read

How to Build High‑Availability Systems: Lessons from a Transaction Platform Evolution

This article shares practical insights on achieving high availability by understanding goals, decomposing requirements, designing resilient architectures, ensuring operability, testing rigorously, and reducing release risk, illustrated through the multi‑stage evolution of a transaction system.

21CTO
21CTO
21CTO
How to Build High‑Availability Systems: Lessons from a Transaction Platform Evolution

Understanding Availability

Industry targets for high availability are often expressed in "nines" (e.g., 99.999%). Achieving these targets requires knowing the user scale, usage scenarios, and specific availability goals for each system.

Decomposing the Goal

Availability goals should be broken down into two concrete sub‑goals:

Low failure frequency – minimize the number of incidents.

Fast recovery time – restore service within minutes when failures occur.

Designing High Availability

High‑availability design is an iterative process driven by business changes. The evolution of a transaction platform is used as a concrete example.

Infancy (pre‑2012)

Goal: satisfy business requirements and launch quickly. The team, mainly .NET developers, built a simple system with low traffic; issues were handled by restarts, scaling, or rollbacks.

Youth (2012‑2013) – Vertical Splitting

Goal: improve development efficiency and isolate failures. As traffic grew from thousands to tens of thousands per day, services were split vertically (e.g., product page, order, payment) and isolated via caching and static rendering. Disaster‑recovery sites were deployed, though coordination mechanisms were initially lacking.

Adolescence (2014‑2015) – Service Miniaturization

Goal: support rapid business growth with efficient, highly available technical capabilities. The monolithic product service was refactored into many small services (inventory, pricing, base data, etc.). This solved product‑page issues but shifted pressure to the order system, which later underwent a full micro‑service transformation.

Adulthood (2015‑present) – Horizontal Splitting

Goal: support massive promotional events with tens of thousands of QPS and millions of daily orders. In 2015, the order system was horizontally sharded into 1,024 tables across 32 databases, each with 32 tables, enabling future scaling.

Operational Practices

High‑availability systems must be operable. Key operational requirements include:

Rate limiting – fast‑fail when traffic exceeds capacity.

Statelessness – enables easy scaling.

Graceful degradation – provide fallback UI/UX when downstream services fail.

Testing

Testing validates that the architecture can handle predicted traffic peaks. It involves estimating traffic, modeling load, and measuring end‑to‑end request paths (e.g., number of database accesses per order) to ensure capacity.

Reducing Release Risk

Strict release processes and gray‑deployment mechanisms are essential. Releases are rolled out in stages (10%, 30%, 50%, 100%) with monitoring to verify correctness before proceeding. Traffic‑based gray releases and easy rollback plans are also required.

Rapid Incident Response

Fast detection (mobile alerts, sub‑minute alarms) and visualized monitoring enable locating failures within minutes. The team uses real‑time dashboards and per‑service instrumentation to achieve sub‑minute detection and rapid diagnosis.

Key Lessons

Value every real traffic peak and build accurate traffic models.

Conduct thorough post‑mortems to elevate problem understanding and solutions.

Availability is not only a technical issue; it requires cross‑functional collaboration.

Single points of failure and uncontrolled releases are the biggest enemies of availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringSystem ArchitectureMicroservicesOperationsScalabilityhigh availability
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.