Operations 15 min read

Designing a Highly Available Transaction System: Real‑World Evolution

This article examines how a large‑scale e‑commerce transaction platform achieved high availability through iterative architectural evolution—from early .NET monoliths to vertical and horizontal micro‑service splits—highlighting practical strategies for fault detection, rapid recovery, scaling, and operational best‑practices.

21CTO
21CTO
21CTO
Designing a Highly Available Transaction System: Real‑World Evolution
High availability refers to how a system ensures a high service availability rate and how it responds to failures, including timely detection, failover, and rapid recovery. This article uses the evolution of Dianping's transaction system to illustrate achieving high availability, emphasizing that high availability is a result and the focus should be on iterative processes and business development.

Understanding Availability

Industry targets for high availability are expressed in nines; the goal varies per system. Engineers must know user scale, usage scenarios, and availability targets—for example, five nines corresponds to only five minutes of downtime per year.

Goals can be decomposed into two sub‑goals: (1) low frequency – reduce the number of failures, and (2) fast recovery – shorten the time to restore service after a failure.

Low Frequency: Reducing Failure Occurrence

Design should evolve with business changes. The evolution of Dianping's transaction system illustrates this:

Infancy (before 2012)

Mission: meet business requirements and launch quickly. The first generation used .NET because the team was familiar with it. Simplicity was prioritized; failures were tolerable due to low traffic.

Childhood (vertical split, 2012‑2013)

Mission: improve development efficiency and isolate failures. As order volume grew to tens of thousands per day, services were vertically split to keep teams small and isolate domains such as product display, order, and payment. Some redundancy (e.g., dual data‑center payment service) was added, though early implementations lacked coordinated updates.

Youth (micro‑service refactor, 2014‑2015)

Mission: support rapid business growth with efficient, highly available technology. The monolithic Deal‑service was broken into many small services (inventory, pricing, base data, etc.). Order and payment systems were also micro‑service‑ized, resulting in hundreds of services and dozens of databases, handling millions of daily orders.

Adulthood (horizontal split, 2015‑present)

Mission: sustain massive promotional events, supporting tens of thousands of QPS and tens of millions of daily orders. During the 2015 "917 Foodie Festival" peak, the order table was sharded into 1,024 tables across 32 databases, eliminating the primary data‑center bottleneck.

Remaining single points include message queues, network, and data‑center links. Real incidents: undetected NIC failures, cache server contention with monitoring infrastructure, and underestimated message‑queue capacity during the 917 promotion.

Future Direction

Continue the principle of "big systems become small, core channels become big, traffic is partitioned": break complex systems into single‑responsibility services, expand core communication frameworks, and segment user traffic into dedicated clusters.

Operational Practices for High Availability

Rate Limiting : Implement fast‑fail mechanisms (e.g., limit QPS to 5,000 when traffic spikes to 10,000) and provide user‑friendly fallback messages.

Statelessness : Keep services stateless to enable easy scaling and traffic routing.

Degradation Capability : Design graceful degradation with clear UI cues (e.g., disable a payment button and show alternative options when a channel fails).

Testability : Estimate traffic for major events, test cluster capacity, and evaluate both peak and normal traffic models.

Reducing Release Risk :

Strict release process with developer‑owned deployment via a platform.

Gray‑release strategy (10%, 30%, 50%, 100%) with real‑time monitoring.

Rollback capability and predefined worst‑case plans.

Fast Incident Response : Aim for 1‑minute detection, 3‑minute localization, and rapid recovery within a total of 5 minutes. This requires effective monitoring visualisation, cross‑team communication, and automated alerting via mobile channels.

Recovery Mechanisms : Use rollback, restart, scaling, and server replacement; for high‑traffic scenarios, focus on traffic control and graceful degradation.

Key Takeaways

Capture real peak‑traffic data to build accurate traffic models.

Conduct thorough post‑mortem analyses using the "5W" method.

High availability is not solely a technical issue; it requires collaboration among development, DBA, operations, and product teams.

Eliminate single points through vertical/horizontal splitting, redundancy, and robust release practices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System ArchitectureMicroservicesOperationshigh availability
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.