How We Built a High‑Availability Membership System for Billions of Users

This article details the design and implementation of a highly available, high‑performance membership platform serving over a billion users, covering Elasticsearch dual‑center clusters, traffic‑isolated clusters, deep ES optimizations, Redis caching strategies, MySQL dual‑center partitioning, seamless migration, and fine‑grained flow‑control and degradation mechanisms.

Open Source Linux
Open Source Linux
Open Source Linux
How We Built a High‑Availability Membership System for Billions of Users

Background

The membership system is a core service tightly coupled with the order flow of all business lines; any outage blocks user orders across the company. After the merger of Tongcheng and eLong, the system must support multiple platforms (apps, mini‑programs) and handle massive request volumes, reaching over 20,000 TPS during peak holidays.

Elasticsearch High‑Availability Solution

Dual‑Center Primary‑Backup Cluster

Two data centers (A and B) host an ES primary cluster in A and a backup cluster in B. All reads/writes go to the primary; data is synchronized to the backup via MQ. If the primary fails, traffic is switched to the backup with minimal downtime, then synchronized back once recovered.

Traffic‑Isolation Three‑Cluster Architecture

To protect the main ES cluster from marketing‑spike traffic, a separate ES cluster handles high‑TPS marketing requests, isolating them from the primary cluster that serves order‑critical queries.

Deep ES Optimizations

Balanced shard distribution to avoid hot nodes.

Thread‑pool size limited to cpu_core * 3 / 2 + 1 to prevent CPU thrashing.

Shard size kept under 50 GB.

Removed duplicate text fields, keeping only keyword to halve storage.

Used filter instead of query for non‑scoring lookups.

Moved result sorting to the member service JVM.

Added routing keys to direct queries to relevant shards.

These changes reduced CPU usage dramatically and cut query latency, as shown by the monitoring graphs.

Member Redis Cache Solution

Initially the system avoided caching due to real‑time consistency requirements. However, a sudden spike from a ticket‑blind‑box activity prompted the introduction of a Redis cache with a 2‑second distributed lock to avoid stale data.

When updating ES, the lock is acquired, the Redis entry is deleted, and concurrent queries respect the lock to prevent writing dirty data back to the cache.

Further analysis revealed a race condition where a query could update the cache after the lock was released; the fix was to make delete‑and‑update operations mutually exclusive.

After deployment, cache hit rates exceeded 90 % and ES load dropped significantly.

Redis Dual‑Center Multi‑Cluster Architecture

Two Redis clusters are deployed in data centers A and B. Writes are performed to both clusters; reads are served locally to minimize latency. This ensures service continuity even if one data center fails.

High‑Availability Member Master DB Solution

Member registration data resides in a relational database. The original SqlServer instance reached physical limits with >10 billion rows, prompting a migration to a dual‑center MySQL partitioned cluster (over 1,000 shards, 1‑master‑3‑slave per center). Data is routed via DBRoute to the master in center A for writes and to the local slave for reads.

Seamless Migration from SqlServer to MySQL

The migration follows a three‑stage approach: full data sync, real‑time dual‑write, and gray‑scale traffic shift. Dual‑write ensures both databases stay consistent; failures trigger retries and logging. Traffic is gradually shifted from SqlServer to MySQL using an A/B platform, with automated result comparison to detect inconsistencies.

Abnormal Member Relationship Governance

Complex bugs caused cross‑account binding errors (e.g., an APP account linked to another user’s WeChat account), leading to data leakage and erroneous order cancellations. A thorough audit identified these anomalies, and defensive logic was added at the code level to block such cases.

Fine‑Grained Flow‑Control and Degradation Strategies

Three layers of flow control are applied:

Hotspot control for accounts generating excessive requests (e.g., black‑market abuse).

Caller‑based limits to prevent a single service from overwhelming the member system.

Global limits that reject traffic exceeding the system’s 30k TPS capacity, preserving core functionality.

Degradation is triggered based on average response time, exception count, or exception ratio, automatically circuit‑breaking affected paths.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchhigh availabilityrediscachingmysql
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.