How to Build High‑Availability Systems: Lessons from a Transaction Platform Evolution
This article shares practical insights on achieving high availability by understanding goals, decomposing requirements, designing resilient architectures, ensuring operability, testing rigorously, and reducing release risk, illustrated through the multi‑stage evolution of a transaction system.
Understanding Availability
Industry targets for high availability are often expressed in "nines" (e.g., 99.999%). Achieving these targets requires knowing the user scale, usage scenarios, and specific availability goals for each system.
Decomposing the Goal
Availability goals should be broken down into two concrete sub‑goals:
Low failure frequency – minimize the number of incidents.
Fast recovery time – restore service within minutes when failures occur.
Designing High Availability
High‑availability design is an iterative process driven by business changes. The evolution of a transaction platform is used as a concrete example.
Infancy (pre‑2012)
Goal: satisfy business requirements and launch quickly. The team, mainly .NET developers, built a simple system with low traffic; issues were handled by restarts, scaling, or rollbacks.
Youth (2012‑2013) – Vertical Splitting
Goal: improve development efficiency and isolate failures. As traffic grew from thousands to tens of thousands per day, services were split vertically (e.g., product page, order, payment) and isolated via caching and static rendering. Disaster‑recovery sites were deployed, though coordination mechanisms were initially lacking.
Adolescence (2014‑2015) – Service Miniaturization
Goal: support rapid business growth with efficient, highly available technical capabilities. The monolithic product service was refactored into many small services (inventory, pricing, base data, etc.). This solved product‑page issues but shifted pressure to the order system, which later underwent a full micro‑service transformation.
Adulthood (2015‑present) – Horizontal Splitting
Goal: support massive promotional events with tens of thousands of QPS and millions of daily orders. In 2015, the order system was horizontally sharded into 1,024 tables across 32 databases, each with 32 tables, enabling future scaling.
Operational Practices
High‑availability systems must be operable. Key operational requirements include:
Rate limiting – fast‑fail when traffic exceeds capacity.
Statelessness – enables easy scaling.
Graceful degradation – provide fallback UI/UX when downstream services fail.
Testing
Testing validates that the architecture can handle predicted traffic peaks. It involves estimating traffic, modeling load, and measuring end‑to‑end request paths (e.g., number of database accesses per order) to ensure capacity.
Reducing Release Risk
Strict release processes and gray‑deployment mechanisms are essential. Releases are rolled out in stages (10%, 30%, 50%, 100%) with monitoring to verify correctness before proceeding. Traffic‑based gray releases and easy rollback plans are also required.
Rapid Incident Response
Fast detection (mobile alerts, sub‑minute alarms) and visualized monitoring enable locating failures within minutes. The team uses real‑time dashboards and per‑service instrumentation to achieve sub‑minute detection and rapid diagnosis.
Key Lessons
Value every real traffic peak and build accurate traffic models.
Conduct thorough post‑mortems to elevate problem understanding and solutions.
Availability is not only a technical issue; it requires cross‑functional collaboration.
Single points of failure and uncontrolled releases are the biggest enemies of availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
