Operations 25 min read

High‑Availability Architecture for a Large‑Scale Membership System

The article describes how a membership system serving billions of users across multiple platforms achieves high performance and high availability through dual‑center Elasticsearch clusters, traffic‑isolated three‑cluster ES architecture, Redis caching with distributed locks, dual‑center MySQL partitioning, and fine‑grained flow‑control and degradation strategies.

Selected Java Interview Questions

Jul 15, 2023

High‑Availability Architecture for a Large‑Scale Membership System

1. Background

The membership system is a core service tightly coupled with the order flow of all business lines; any failure would block user ordering across the entire company, so it must guarantee high performance, high availability, and stable service.

After the merger of Tongcheng and eLong, the system needs to unify member data across multiple platforms (Tongcheng APP, eLong APP, WeChat mini‑programs, etc.). Scenarios such as cross‑marketing, order center, membership levels, mileage, coupons, and various marketing activities require querying unified member relationships, leading to rapidly increasing request volume and peak TPS exceeding 20,000 during holidays. The article explains how the system maintains high performance and availability under such load.

2. Elasticsearch High‑Availability Solution

2.1 Dual‑Center Primary‑Backup ES Cluster Architecture

With over a billion members and multiple query dimensions (phone number, WeChat unionid, eLong card number, etc.), Elasticsearch (ES) is used to store unified member relationships. ES clusters are inherently high‑available, but additional measures are needed for data‑center failures, hardware faults, or traffic spikes.

Two data centers (A and B) are deployed: the primary ES cluster in data center A and the backup ES cluster in data center B. Writes go to the primary cluster; data is synchronized to the backup via MQ. If the primary fails, configuration switches reads/writes to the backup cluster, achieving rapid failover. After recovery, data is synchronized back to the primary.

When a node crashes, its replica shard is promoted to primary, but this is insufficient for data‑center outages. The dual‑center architecture ensures continuity without manual intervention.

2.2 ES Traffic‑Isolation Three‑Cluster Architecture

After a massive traffic surge during a marketing event, the team realized the need to separate critical order‑flow requests from high‑TPS marketing requests. Requests were classified into two priority groups: high‑priority order‑flow and lower‑priority marketing. A dedicated ES cluster was built for marketing spikes, isolating it from the primary ES cluster.

2.3 Deep ES Cluster Optimizations

Performance issues during peak times were traced to several factors:

Uneven shard distribution causing hotspot nodes.

Thread‑pool size set too high, leading to CPU spikes.

Shard memory size exceeding 50 GB, slowing queries.

String fields stored as both text and keyword, doubling storage.

Using query instead of filter, incurring unnecessary scoring.

Sorting results in ES rather than in the member service JVM.

Missing routing keys, causing queries to hit all shards.

After applying these optimizations, CPU usage dropped dramatically and query performance improved, as shown in the following charts.

3. Member Redis Cache Scheme

Initially the system did not use caching because ES already provided sub‑5 ms latency at 30 k TPS. However, a blind‑box ticket promotion caused a sudden surge, prompting the introduction of a Redis cache.

3.1 Solving ES Near‑Real‑Time Delay Causing Redis Inconsistency

ES writes are near‑real‑time; a newly indexed document becomes searchable after about one second. During this window, a request may read stale data from ES and write it back to Redis, causing inconsistency.

The solution adds a 2‑second distributed lock in Redis when updating ES, deletes the member’s cache, and forces subsequent reads to wait for the lock before updating the cache, preventing stale writes.

Further analysis revealed a race condition where a query could acquire the lock before the update request, leading to stale data being written after the lock is released. Making the delete‑and‑update operations mutually exclusive resolves this.

After deployment, cache hit rate exceeded 90 %, dramatically reducing ES load and improving overall performance.

3.2 Redis Dual‑Center Multi‑Cluster Architecture

Two Redis clusters are deployed, one in each data center. Writes are performed to both clusters (dual‑write); reads are served locally to reduce latency. If one data center fails, the other continues to provide full member services.

4. High‑Availability Member Primary‑DB Scheme

The member registration data originally lived in a single SQL Server instance that reached physical limits after storing over ten billion records. To avoid a catastrophic failure, the team migrated to a dual‑center MySQL partitioned cluster.

4.1 MySQL Dual‑Center Partition Cluster

More than a thousand shards were created, each holding roughly one million rows. The cluster uses a 1‑master‑3‑slave topology, with the master in data center A and slaves in data center B, synchronized over a dedicated line with sub‑millisecond latency. Reads are routed locally, writes go to the master, and failover promotes a slave to master if needed.

Stress tests showed >20 k TPS with average latency under 10 ms, meeting performance goals.

4.2 Smooth Migration from SQL Server to MySQL

The migration had three major challenges: zero‑downtime cutover, rewriting DAL code for numerous legacy interfaces, and ensuring real‑time synchronization of both historical and new data. The solution combined full data sync, incremental sync, and a gray‑scale traffic switch.

During the trial period, writes go to SQL Server while an asynchronous thread writes to MySQL; failures are retried three times, logged, and manually investigated if still failing. This guarantees MySQL eventually matches the authoritative SQL Server data.

Read traffic is gradually shifted from SQL Server to MySQL using an A/B platform, starting at 1 % and increasing after verification. Each request compares results from both databases; mismatches are logged for manual resolution before further traffic increase.

The overall migration workflow is illustrated below:

4.3 MySQL and ES Primary‑Backup Cluster

To guard against DAL component failures, a dual‑write path to ES was added. If the DAL or MySQL becomes unavailable, reads/writes can be switched to ES, and once MySQL recovers, data is synchronized back.

5. Abnormal Member Relationship Governance

Incorrect binding of APP accounts to other users' WeChat accounts can cause severe issues such as cross‑viewing and unauthorized order cancellations. Complex logic was implemented to detect and remediate these anomalies, sealing the vulnerabilities at the code level.

6. Outlook: More Fine‑Grained Flow‑Control and Degradation Strategies

6.1 Fine‑Grained Flow‑Control

Hotspot control: limit requests from accounts that generate excessive traffic (e.g., black‑market abuse).

Per‑caller flow rules: prevent bugs in calling services from causing traffic spikes.

Global flow limits: cap total TPS at the system’s sustainable level (e.g., 30 k) and quickly reject excess traffic.

6.2 Fine‑Grained Degradation

Degrade based on average response time of dependent services; if thresholds are exceeded continuously, trigger circuit‑breaker. Also degrade when exception count or exception ratio surpasses configured limits within a minute.

Current pain point: managing member‑call accounts. Many developers reuse old accounts after department changes, making it hard to apply precise flow‑control rules. The team plans to audit each account individually despite the large effort required.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems backend-architecture Elasticsearch High Availability Redis mysql

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.