Operations 20 min read

Building a Billion‑User Membership System: ES, Redis & MySQL High‑Availability

This article details how a large‑scale membership platform achieves high performance and near‑zero downtime by employing dual‑center Elasticsearch clusters, traffic‑isolated ES architectures, deep ES optimizations, Redis caching with distributed locks, and a seamless MySQL migration with partitioned, dual‑center databases.

21CTO

Nov 8, 2022

Building a Billion‑User Membership System: ES, Redis & MySQL High‑Availability

1. Background

The membership system is a foundational service tightly coupled with the order flow of all business lines; any failure prevents users from placing orders, so it must guarantee high performance, high availability, and stable service.

2. Elasticsearch High‑Availability Solution

2.1 Dual‑center primary‑backup ES cluster – Two data centers (A and B) host the primary and backup ES clusters respectively. Writes go to the primary cluster in A and are synchronized to the backup in B via MQ. If the primary fails, the system switches reads and writes to the backup with minimal downtime, then resynchronizes once the primary recovers.

2.2 Three‑cluster traffic isolation – To protect the main ES cluster from traffic spikes caused by marketing activities, a separate ES cluster handles high‑TPS marketing requests, isolating them from the primary cluster that serves order‑critical traffic.

2.3 Deep ES optimizations – Issues such as uneven shard distribution, oversized thread pools, large shard memory, dual‑field mappings, and unnecessary query scoring were addressed. Routing keys were added to limit shard queries. These changes dramatically reduced CPU usage and improved query latency.

3. Membership Redis Caching Scheme

Initially the system avoided caching due to near‑real‑time ES latency and the risk of stale data. However, a sudden surge in traffic from a blind‑box promotion prompted the introduction of a Redis cache with a 2‑second distributed lock to ensure consistency after ES updates. The design also handles lock‑contention scenarios to avoid stale writes.

Cache hit rates exceed 90%, significantly relieving ES pressure and improving overall performance.

4. High‑Availability Membership Primary Database

The original SqlServer instance reached physical limits with over a billion records. A dual‑center MySQL partitioned cluster was adopted, splitting the data into more than 1,000 shards (≈1 million rows each) with a 1‑master‑3‑slave architecture. The master resides in data center A, slaves in B, synchronized via a dedicated link with sub‑millisecond latency. Reads are routed locally, writes to the master, achieving sub‑10 ms latency and >20 k TPS under load.

5. Migration and Dual‑Write Strategy

To migrate from SqlServer to MySQL without downtime, a real‑time dual‑write approach was used: writes go to SqlServer and asynchronously to MySQL with retries and logging. Incremental sync fills the gap between full data transfer and dual‑write activation. Traffic is gradually shifted from SqlServer to MySQL using A/B testing, with automated result comparison to ensure consistency before full cut‑over.

6. ES as a Fallback for DAL Failures

If the DAL component or MySQL fails, reads and writes can be switched to the ES cluster, with later synchronization back to MySQL once it recovers.

7. Abnormal Membership Relationship Governance

Complex logic was implemented to detect and correct abnormal member bindings that could cause cross‑account data exposure, ensuring data correctness and preventing severe customer complaints.

8. Future: Fine‑Grained Flow Control and Degradation

Plans include hotspot throttling for abusive accounts, per‑caller flow‑control rules to prevent accidental traffic bursts, and global limits to protect the system from traffic beyond its 30 k TPS capacity. Degradation strategies based on average response time and error rates will automatically trigger circuit‑breakers when thresholds are exceeded.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system architecture Operations High Availability Redis

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.