Backend Development 20 min read

High‑Availability Architecture for a Billion‑User Membership System: ES Dual‑Center Clusters, Traffic Isolation, Redis Caching, and MySQL Migration

The article describes how a large‑scale membership system serving over a billion users achieves high performance and availability through dual‑center Elasticsearch clusters, traffic‑isolated three‑cluster designs, Redis caching with distributed locks, and a seamless migration from SQL Server to sharded MySQL, while also detailing operational safeguards and fine‑grained flow‑control strategies.

Top Architect
Top Architect
Top Architect
High‑Availability Architecture for a Billion‑User Membership System: ES Dual‑Center Clusters, Traffic Isolation, Redis Caching, and MySQL Migration

The membership system is a core service that connects all business lines; any outage would block order placement across the company, so it must guarantee high performance, high availability, and stable service.

After the merger of two companies, the system needed to unify member data across multiple platforms (mobile apps, mini‑programs, etc.), handling cross‑marketing scenarios such as linking train‑ticket purchases to hotel red‑packets.

Because traffic spikes (e.g., holiday peaks) can exceed 20k TPS, the team designed a dual‑center Elasticsearch (ES) master‑slave cluster: the primary cluster runs in data center A, the standby in data center B, with data synchronized via MQ. Failover is achieved by switching reads/writes to the standby cluster.

To prevent marketing‑driven traffic from affecting the main order flow, a three‑cluster ES architecture was introduced: a dedicated high‑TPS cluster for marketing bursts, isolated from the primary ES cluster serving order‑critical queries.

Deep ES optimizations were applied, including balanced shard distribution, appropriate thread‑pool sizing (cpu * 1.5 + 1), limiting shard memory to ≤50 GB, removing unnecessary text fields, using filters instead of queries, moving sorting to the application layer, and adding routing keys to reduce shard fan‑out.

Redis caching was initially avoided due to real‑time consistency concerns, but after a high‑traffic blind‑box event, a cache layer with a 2‑second distributed lock was added to avoid stale data caused by ES’s near‑real‑time delay.

The underlying relational database (originally SQL Server) could no longer scale beyond a billion rows, so a dual‑center MySQL sharding solution was adopted: >1,000 shards, 1‑master‑3‑slave per data center, with near‑zero replication lag and local‑read routing.

Migration from SQL Server to MySQL employed a phased approach: full data sync, real‑time dual‑write, incremental sync, and gradual traffic gray‑release (A/B testing) while continuously verifying data consistency.

To further improve resilience, the system added a MySQL‑ES master‑slave fallback: if the DAL component or MySQL fails, reads/writes can be switched to ES and later synchronized back.

Additional safeguards include fine‑grained flow‑control (hotspot limiting, per‑caller limits, global TPS caps) and degradation strategies based on response time, error count, and error ratio.

Finally, the article outlines future work on more precise traffic control, account governance, and continuous reliability engineering.

System ArchitectureElasticsearchHigh AvailabilityRedisMySQLtraffic isolation
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.