How We Built a Billion‑User High‑Availability Membership System with Dual‑Center ES, Redis, and MySQL
This article details the design and implementation of a high‑performance, highly available membership platform serving billions of users, covering dual‑center Elasticsearch clusters, traffic‑isolated three‑cluster ES architecture, Redis caching strategies, MySQL dual‑center partitioning, seamless migration, and fine‑grained flow‑control and degradation mechanisms.
Background
The membership system is a core service tightly coupled with the order flow of all business lines; any outage blocks user orders across the company. After the merger of Tongcheng and eLong, the system must support cross‑platform member relationships for multiple apps and mini‑programs, leading to massive request volumes (over 20k TPS during peak holidays).
ES High‑Availability Solution
We adopted a dual‑center primary‑backup Elasticsearch cluster: the primary cluster resides in data center A, the backup in data center B. Reads and writes go to the primary; data is synchronized to the backup via MQ. In case the primary fails, traffic is switched to the backup with minimal downtime, and later synchronized back.
ES Traffic‑Isolation Three‑Cluster Architecture
To protect the main order flow from traffic spikes (e.g., marketing flash‑sales), we classified request types into two groups: high‑priority order‑related requests and high‑TPS marketing requests. A dedicated ES cluster handles the marketing traffic, isolating it from the primary ES cluster.
Deep ES Optimization
Balanced shard distribution to avoid hot nodes.
Thread‑pool size limited to cpu_core * 3 / 2 + 1 to prevent CPU spikes.
Shard memory kept under 50 GB.
Removed duplicate text fields for keyword‑only queries.
Used filter instead of query for non‑scoring lookups.
Offloaded sorting to the JVM.
Added routing keys to target specific shards.
These optimizations dramatically reduced CPU usage and query latency.
Member Redis Cache Scheme
Initially the system avoided caching due to real‑time consistency concerns, but a sudden traffic surge during a ticket‑blind‑box event prompted the introduction of a Redis cache with a 2‑second distributed lock to prevent stale data writes. The lock ensures that if Elasticsearch has not yet reflected a recent update, the cache is not refreshed, avoiding inconsistency.
Redis Dual‑Center Multi‑Cluster Architecture
Both data centers host a full Redis cluster. Writes are performed in both clusters (dual‑write) and succeed only when both acknowledge; reads are served locally to minimize latency. This design guarantees service continuity even if one data center fails.
High‑Availability Member Master‑DB Scheme
Member registration data moved from a single SqlServer instance to a MySQL dual‑center partitioned cluster (over 1,000 shards, 1‑master‑3‑slave per center). The master resides in data center A, slaves in B, with sub‑millisecond replication. Reads are routed locally, writes to the master, and failover promotes a slave to master if needed.
Seamless Migration from SqlServer to MySQL
The migration follows a “full sync → incremental sync → real‑time gray‑switch” approach. Real‑time dual‑write keeps both databases in sync; failures trigger retries and logging. Traffic is gradually shifted from SqlServer to MySQL using A/B testing, with automated result comparison to ensure consistency before each increase.
MySQL & ES Master‑Backup Cluster Scheme
To mitigate DAL component failures, we added an Elasticsearch write path alongside MySQL. If MySQL or DAL fails, reads/writes can be switched to ES, then synchronized back once MySQL recovers.
Abnormal Member Relationship Governance
We identified and fixed rare bugs that caused cross‑account binding errors, preventing scenarios where a user could see or manipulate another user's orders.
Future Fine‑Grained Flow‑Control and Degradation Strategies
We plan to implement hotspot limits for abusive accounts, per‑caller flow rules, and global throttling to protect the system from extreme traffic bursts. Degradation will be triggered based on average response time, error count, or error ratio, automatically circuit‑breaking affected services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
