High‑Availability Architecture for a Billion‑Scale Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, and MySQL Migration

This article describes how a membership platform serving over ten billion users achieves high performance and fault tolerance through a dual‑center Elasticsearch cluster, traffic‑isolated three‑cluster ES design, Redis multi‑center caching, and a seamless migration from SQL Server to a partitioned MySQL architecture, while detailing operational safeguards and fine‑grained flow‑control strategies.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
High‑Availability Architecture for a Billion‑Scale Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, and MySQL Migration

Hello everyone, I am Chen (the author). The membership system is a core service tightly coupled with the order flow of all business lines; any failure blocks ordering across the whole company, so it must be high‑performance, highly available, and stable.

After the merger of Tongcheng and eLong, the system must support multiple platforms (Tongcheng APP, eLong APP, WeChat mini‑programs, etc.) and provide unified member relationships for cross‑marketing scenarios such as sending hotel coupons to a train‑ticket buyer.

Because request volume and concurrency have grown dramatically (TPS exceeding 20,000 during the Qingming holiday), the article focuses on how the system maintains high performance and availability.

Elasticsearch High‑Availability Solution

Dual‑Center Primary‑Backup ES Cluster Architecture

Member data (over ten billion records) is stored in Elasticsearch (ES) to support diverse query dimensions (phone, WeChat unionid, eLong card number, etc.).

ES itself provides HA: if a node fails, its replica shard is promoted to primary. However, a single‑datacenter failure (power outage, hardware crash, traffic spike) would still cause service disruption.

The proposed solution deploys a primary ES cluster in Data Center A and a backup ES cluster in Data Center B. All reads/writes go to the primary; data is synchronized to the backup via MQ. In case of primary failure, configuration switches reads/writes to the backup instantly, and after recovery the data is synced back.

Three‑Cluster ES Traffic Isolation Architecture

A massive marketing campaign once caused a single request to invoke the member system >10 times, pushing ES TPS to the brink. To protect the order‑critical path, callers are classified into two groups: high‑priority (order‑related) and low‑priority (marketing). A dedicated ES cluster handles high‑TPS marketing traffic, isolating it from the primary cluster.

Deep ES Cluster Optimizations

Key optimizations include:

Balancing shard distribution to avoid hot nodes.

Setting thread‑pool size to cpu_cores * 3 / 2 + 1 to prevent CPU thrashing.

Limiting shard size to ≤50 GB.

Removing unnecessary text fields, keeping only keyword for exact matches.

Using filter instead of query for non‑scoring lookups.

Performing result sorting in the member service JVM.

Adding routing keys to target specific shards.

After these changes, CPU usage dropped dramatically and query latency improved (average 5 ms, 99‑line TPS >30 k).

Member Redis Caching Scheme

Initially the system avoided caching because ES performance was sufficient and data consistency was critical. However, a flash‑sale activity exposed the need for a cache to absorb sudden spikes.

Resolving ES‑to‑Redis Inconsistency (≈1 s Delay)

ES writes are near‑real‑time; a newly indexed document becomes searchable after ~1 s. During this window, a request may read stale data from ES and write it into Redis, causing inconsistency.

The fix adds a 2‑second distributed lock in Redis when updating ES, deletes the member’s cache, and blocks any concurrent reads from updating the cache until the lock expires.

Additional concurrency analysis revealed a race condition between “delete cache” and “update cache”. Making these operations mutually exclusive eliminated the remaining inconsistency.

Post‑implementation cache hit rate exceeds 90 %, dramatically reducing ES load.

Redis Dual‑Center Multi‑Cluster Architecture

Two Redis clusters are deployed in Data Centers A and B. Writes are performed on both; reads are served locally. If one datacenter fails, the other continues to provide full member services.

High‑Availability Member Primary‑Database Scheme

Member registration details reside in a relational database. SQL Server reached physical limits after storing >10 billion rows, prompting migration to MySQL.

MySQL Dual‑Center Partitioned Cluster

The data is sharded into >1,000 partitions (~1 million rows each). The cluster uses a 1‑master‑3‑slave topology: master in Data Center A, slaves in Data Center B, synchronized over a dedicated line with <1 ms latency. DBRoute routes writes to the master and reads to the local datacenter.

Stress tests show >20 k TPS with average latency <10 ms.

Seamless Migration from SQL Server to MySQL

Migration challenges include zero‑downtime cut‑over, handling numerous legacy interfaces, and guaranteeing data consistency. The solution combines full data sync, incremental sync, and real‑time traffic gray‑release.

During the trial phase, writes go to SQL Server while an asynchronous thread writes to MySQL; failures are retried up to three times, logged, and manually investigated.

Read traffic is gradually shifted from SQL Server to MySQL using A/B testing (1 % → … → 100 %). A verification step compares query results from both databases; mismatches are logged for manual resolution before further traffic increase.

The overall flow includes a night‑time full data sync, enabling dual‑write, performing incremental sync, monitoring logs, and finally switching 100 % of reads/writes to MySQL.

MySQL & ES Primary‑Backup Cluster

To guard against DAL component failures, a secondary data source (ES) is kept in sync. If DAL or MySQL fails, reads/writes are switched to ES until MySQL recovers, after which data is resynced and traffic returns to MySQL.

Abnormal Member Relationship Governance

Beyond availability, data correctness is vital. A distributed concurrency bug once bound an APP account to another user’s WeChat mini‑program account, exposing orders across users and causing severe complaints.

Complex detection logic was implemented to identify such anomalies, and the member API was hardened to prevent future occurrences.

Future Directions: Finer‑Grained Flow‑Control and Degradation

More Precise Flow‑Control

Hotspot limiting for accounts generating massive duplicate requests, per‑caller flow rules to curb buggy client code, and global thresholds to reject traffic that exceeds the system’s 30 k TPS capacity.

More Precise Degradation

Degrade based on average response time of dependent services and on exception count/rate. Account‑based governance is also being refined to ensure each caller’s usage scenario is known and properly throttled.

Source: Tongcheng‑eLong

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsScalabilityElasticsearchmysql
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.