Operations 20 min read

High‑Availability Architecture for a Billion‑Scale Membership System: Dual‑Center ES, Redis, and MySQL Solutions

This article details the design and implementation of a highly available, high‑performance membership system serving over a billion users, covering dual‑center Elasticsearch clusters, traffic‑isolated three‑cluster ES architecture, Redis dual‑center caching, MySQL partitioned clusters, migration strategies, and refined flow‑control and degradation mechanisms.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
High‑Availability Architecture for a Billion‑Scale Membership System: Dual‑Center ES, Redis, and MySQL Solutions

Introduction

The membership system is a core service tightly coupled with the order flow of all business lines; any failure blocks user ordering across the entire company, so it must guarantee high performance and high availability.

After the merger of Tongcheng and eLong, the system needed to unify member data across multiple platforms (apps, mini‑programs) and handle rapidly increasing request volumes, reaching over 20,000 TPS during peak holidays.

Elasticsearch High‑Availability Solution

Dual‑Center Primary‑Backup Cluster

Two data centers (A and B) host separate ES clusters; the primary cluster runs in A, the backup in B. Writes go to the primary, and data is synchronized to the backup via MQ. In case of a primary failure, traffic is switched to the backup within seconds, and later synchronized back.

Traffic‑Isolation Three‑Cluster Architecture

Requests are classified into two priority groups: (1) order‑critical requests that must be high‑priority, and (2) marketing‑driven high‑TPS requests that can be isolated. A dedicated ES cluster handles the marketing burst traffic, preventing it from affecting the primary order‑critical cluster.

Deep ES Optimizations

Balanced shard distribution to avoid hotspot nodes.

Thread‑pool size limited to cpu_core * 3 / 2 + 1 to prevent CPU thrashing.

Shard size kept under 50 GB.

Removed unnecessary text fields, keeping only keyword for member lookups.

Used filter instead of query to skip relevance scoring.

Performed result sorting in the member service JVM to reduce ES load.

Added routing keys to target specific shards, cutting unnecessary broadcast queries.

These tweaks dramatically reduced CPU usage and improved query latency.

Redis Caching Strategy

Initially the member system avoided caching due to real‑time consistency requirements, but a sudden spike from a ticket‑blind‑box event prompted the introduction of a cache.

Resolving ES‑to‑Redis Inconsistency

Because ES is near‑real‑time (≈1 s delay), a race could cause stale data to be written back to Redis. The solution added a 2‑second distributed lock around ES updates and deferred cache deletion, ensuring that concurrent reads do not overwrite fresh data.

Dual‑Center Multi‑Cluster Redis

Two Redis clusters are deployed in data centers A and B. Writes are performed to both (dual‑write) and reads are served locally, providing seamless failover if one center goes down.

High‑Availability Member Primary Database

MySQL Dual‑Center Partitioned Cluster

Member data (over 10 billion rows) is sharded into more than 1,000 partitions, each holding roughly one million rows. The cluster follows a 1‑master‑3‑slave model, with the master in A and slaves in B, synchronized over a dedicated link with sub‑millisecond latency.

Smooth Migration from SQL Server to MySQL

The migration employs a three‑stage approach: full data sync, real‑time dual‑write, and gradual traffic gray‑release (A/B testing). Automatic retry, logging, and manual verification ensure data consistency throughout.

MySQL + Elasticsearch Backup

To guard against DAL component failures, member data is also written to an ES backup cluster. If MySQL becomes unavailable, reads are switched to ES until MySQL recovers.

Abnormal Member Relationship Governance

Complex bugs caused cross‑account binding errors, leading to privacy breaches and financial loss. The team implemented deep code‑level checks and automated remediation to prevent such anomalies.

Refined Flow‑Control and Degradation Strategies

Hotspot throttling limits excessive requests from a single member ID; per‑caller flow rules protect against buggy client loops; global thresholds reject traffic that would overwhelm the system. Degradation is triggered by average response time spikes or high error rates, automatically circuit‑breaking affected paths.

Conclusion

The combination of dual‑center ES, Redis, and MySQL clusters, together with meticulous traffic isolation, optimization, and graceful migration techniques, delivers a resilient, scalable membership platform capable of handling billions of users and extreme traffic bursts.

Distributed SystemsElasticsearchHigh AvailabilityRedisMySQLScaling
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.