Operations 20 min read

High‑Availability Architecture for a Billion‑Scale Membership System

This article details how a large‑scale membership platform achieves high performance and fault tolerance through dual‑center Elasticsearch clusters, traffic‑isolated architectures, Redis caching with distributed locks, sharded MySQL migration, and fine‑grained flow‑control and degradation strategies.

Laravel Tech Community

Feb 22, 2023

High‑Availability Architecture for a Billion‑Scale Membership System

Background: the membership system is a core service for order processing; after the merger of two companies, it must provide a unified member relationship across multiple apps and mini‑programs, handling billions of users and peak TPS exceeding 20,000.

ES High‑Availability Solution

Introduces a dual‑center primary‑backup Elasticsearch cluster deployed in two data centers (A and B). The primary cluster serves reads/writes, while updates are synchronized to the backup via MQ; failover switches traffic to the backup instantly if the primary fails.

Also presents a separate ES cluster dedicated to high‑TPS marketing activities to isolate traffic spikes.

ES Cluster Deep Optimization

Identifies several performance bottlenecks:

Uneven shard distribution causing hotspot nodes.

Thread‑pool size set too high, leading to CPU spikes.

Shard memory >100 GB; recommended ≤50 GB per shard.

String fields stored as both text and keyword, doubling storage.

Using query instead of filter for non‑scoring lookups.

Sorting results in ES instead of the JVM.

Missing routing keys causing unnecessary shard queries.

After applying these optimizations, CPU usage dropped dramatically and query latency improved, as shown in the following charts.

Redis Caching Strategy

Initially the system avoided caching due to near‑real‑time delay of ES (≈1 s) which could cause stale data in Redis. The solution adds a 2‑second distributed lock when updating ES, then deletes the related Redis entry; concurrent queries acquire the lock, detect the pending ES update, and skip cache writes, preventing inconsistency.

Redis Dual‑Center Multi‑Cluster Architecture

Deploys active‑active Redis clusters in data centers A and B. Writes are performed to both clusters; reads are served locally to minimize latency. If one data center fails, the other continues to provide full member services.

MySQL Dual‑Center Partition Cluster

Shards the >10 billion member records into over 1,000 shards (≈1 million rows each). Uses a 1‑master‑3‑slave topology with the master in data center A and slaves in B, synchronized over a dedicated link with <1 ms latency. DBRoute routes writes to the master and reads to the local data center, achieving >20,000 TPS and ~10 ms average latency.

Seamless Migration from SqlServer to MySQL

Describes a zero‑downtime migration strategy comprising full data sync, incremental sync, and real‑time dual‑write. During trial runs, writes go to SqlServer while an asynchronous thread writes to MySQL with retries and logging. After validation, traffic is gray‑scaled from 100 % SqlServer reads to MySQL using A/B testing, with automated result comparison to ensure consistency.

MySQL and ES Primary‑Backup Cluster

Provides a fallback path: if the DAL component or MySQL fails, reads are switched to Elasticsearch; once MySQL recovers, data is resynced and traffic switched back.

Abnormal Member Relationship Governance

Identifies and rectifies cases where user accounts become incorrectly bound across platforms, preventing cross‑account data leakage and erroneous order cancellations.

Future Fine‑Grained Flow Control and Degradation

Proposes hotspot throttling for abusive accounts, per‑account flow rules to curb buggy client bursts, and global rate limiting to protect the system from extreme traffic spikes. Also outlines degradation strategies based on average response time and error rates, and discusses challenges in managing caller accounts for precise control.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

traffic isolation

Written by

Laravel Tech Community

Specializing in Laravel development, we continuously publish fresh content and grow alongside the elegant, stable Laravel framework.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.