How We Built a Billion‑User High‑Availability Membership System with Dual‑Center ES, Redis, and MySQL

This article details the design and implementation of a high‑performance, highly available membership platform serving billions of users, covering dual‑center Elasticsearch clusters, traffic‑isolated three‑cluster ES architecture, Redis caching strategies, MySQL dual‑center partitioning, seamless migration, and fine‑grained flow‑control and degradation mechanisms.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
How We Built a Billion‑User High‑Availability Membership System with Dual‑Center ES, Redis, and MySQL

Background

The membership system is a core service tightly coupled with the order flow of all business lines; any outage blocks user orders across the company. After the merger of Tongcheng and eLong, the system must support cross‑platform member relationships for multiple apps and mini‑programs, leading to massive request volumes (over 20k TPS during peak holidays).

ES High‑Availability Solution

We adopted a dual‑center primary‑backup Elasticsearch cluster: the primary cluster resides in data center A, the backup in data center B. Reads and writes go to the primary; data is synchronized to the backup via MQ. In case the primary fails, traffic is switched to the backup with minimal downtime, and later synchronized back.

ES Traffic‑Isolation Three‑Cluster Architecture

To protect the main order flow from traffic spikes (e.g., marketing flash‑sales), we classified request types into two groups: high‑priority order‑related requests and high‑TPS marketing requests. A dedicated ES cluster handles the marketing traffic, isolating it from the primary ES cluster.

Deep ES Optimization

Balanced shard distribution to avoid hot nodes.

Thread‑pool size limited to cpu_core * 3 / 2 + 1 to prevent CPU spikes.

Shard memory kept under 50 GB.

Removed duplicate text fields for keyword‑only queries.

Used filter instead of query for non‑scoring lookups.

Offloaded sorting to the JVM.

Added routing keys to target specific shards.

These optimizations dramatically reduced CPU usage and query latency.

Member Redis Cache Scheme

Initially the system avoided caching due to real‑time consistency concerns, but a sudden traffic surge during a ticket‑blind‑box event prompted the introduction of a Redis cache with a 2‑second distributed lock to prevent stale data writes. The lock ensures that if Elasticsearch has not yet reflected a recent update, the cache is not refreshed, avoiding inconsistency.

Redis Dual‑Center Multi‑Cluster Architecture

Both data centers host a full Redis cluster. Writes are performed in both clusters (dual‑write) and succeed only when both acknowledge; reads are served locally to minimize latency. This design guarantees service continuity even if one data center fails.

High‑Availability Member Master‑DB Scheme

Member registration data moved from a single SqlServer instance to a MySQL dual‑center partitioned cluster (over 1,000 shards, 1‑master‑3‑slave per center). The master resides in data center A, slaves in B, with sub‑millisecond replication. Reads are routed locally, writes to the master, and failover promotes a slave to master if needed.

Seamless Migration from SqlServer to MySQL

The migration follows a “full sync → incremental sync → real‑time gray‑switch” approach. Real‑time dual‑write keeps both databases in sync; failures trigger retries and logging. Traffic is gradually shifted from SqlServer to MySQL using A/B testing, with automated result comparison to ensure consistency before each increase.

MySQL & ES Master‑Backup Cluster Scheme

To mitigate DAL component failures, we added an Elasticsearch write path alongside MySQL. If MySQL or DAL fails, reads/writes can be switched to ES, then synchronized back once MySQL recovers.

Abnormal Member Relationship Governance

We identified and fixed rare bugs that caused cross‑account binding errors, preventing scenarios where a user could see or manipulate another user's orders.

Future Fine‑Grained Flow‑Control and Degradation Strategies

We plan to implement hotspot limits for abusive accounts, per‑caller flow rules, and global throttling to protect the system from extreme traffic bursts. Degradation will be triggered based on average response time, error count, or error ratio, automatically circuit‑breaking affected services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend EngineeringSystem ArchitectureScalable DesignElasticsearchhigh availabilityredismysql
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.