High‑Availability Architecture for a Billion‑Scale Membership System: Dual‑Center ES, Redis, and MySQL Solutions
This article details the design and implementation of a highly available, high‑performance membership system serving over a billion users, covering dual‑center Elasticsearch clusters, traffic‑isolated three‑cluster ES architecture, Redis dual‑center caching, MySQL partitioned clusters, migration strategies, and refined flow‑control and degradation mechanisms.
Introduction
The membership system is a core service tightly coupled with the order flow of all business lines; any failure blocks user ordering across the entire company, so it must guarantee high performance and high availability.
After the merger of Tongcheng and eLong, the system needed to unify member data across multiple platforms (apps, mini‑programs) and handle rapidly increasing request volumes, reaching over 20,000 TPS during peak holidays.
Elasticsearch High‑Availability Solution
Dual‑Center Primary‑Backup Cluster
Two data centers (A and B) host separate ES clusters; the primary cluster runs in A, the backup in B. Writes go to the primary, and data is synchronized to the backup via MQ. In case of a primary failure, traffic is switched to the backup within seconds, and later synchronized back.
Traffic‑Isolation Three‑Cluster Architecture
Requests are classified into two priority groups: (1) order‑critical requests that must be high‑priority, and (2) marketing‑driven high‑TPS requests that can be isolated. A dedicated ES cluster handles the marketing burst traffic, preventing it from affecting the primary order‑critical cluster.
Deep ES Optimizations
Balanced shard distribution to avoid hotspot nodes.
Thread‑pool size limited to cpu_core * 3 / 2 + 1 to prevent CPU thrashing.
Shard size kept under 50 GB.
Removed unnecessary text fields, keeping only keyword for member lookups.
Used filter instead of query to skip relevance scoring.
Performed result sorting in the member service JVM to reduce ES load.
Added routing keys to target specific shards, cutting unnecessary broadcast queries.
These tweaks dramatically reduced CPU usage and improved query latency.
Redis Caching Strategy
Initially the member system avoided caching due to real‑time consistency requirements, but a sudden spike from a ticket‑blind‑box event prompted the introduction of a cache.
Resolving ES‑to‑Redis Inconsistency
Because ES is near‑real‑time (≈1 s delay), a race could cause stale data to be written back to Redis. The solution added a 2‑second distributed lock around ES updates and deferred cache deletion, ensuring that concurrent reads do not overwrite fresh data.
Dual‑Center Multi‑Cluster Redis
Two Redis clusters are deployed in data centers A and B. Writes are performed to both (dual‑write) and reads are served locally, providing seamless failover if one center goes down.
High‑Availability Member Primary Database
MySQL Dual‑Center Partitioned Cluster
Member data (over 10 billion rows) is sharded into more than 1,000 partitions, each holding roughly one million rows. The cluster follows a 1‑master‑3‑slave model, with the master in A and slaves in B, synchronized over a dedicated link with sub‑millisecond latency.
Smooth Migration from SQL Server to MySQL
The migration employs a three‑stage approach: full data sync, real‑time dual‑write, and gradual traffic gray‑release (A/B testing). Automatic retry, logging, and manual verification ensure data consistency throughout.
MySQL + Elasticsearch Backup
To guard against DAL component failures, member data is also written to an ES backup cluster. If MySQL becomes unavailable, reads are switched to ES until MySQL recovers.
Abnormal Member Relationship Governance
Complex bugs caused cross‑account binding errors, leading to privacy breaches and financial loss. The team implemented deep code‑level checks and automated remediation to prevent such anomalies.
Refined Flow‑Control and Degradation Strategies
Hotspot throttling limits excessive requests from a single member ID; per‑caller flow rules protect against buggy client loops; global thresholds reject traffic that would overwhelm the system. Degradation is triggered by average response time spikes or high error rates, automatically circuit‑breaking affected paths.
Conclusion
The combination of dual‑center ES, Redis, and MySQL clusters, together with meticulous traffic isolation, optimization, and graceful migration techniques, delivers a resilient, scalable membership platform capable of handling billions of users and extreme traffic bursts.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.