High‑Availability Architecture for a Billion‑Scale Membership System
This article details how a large‑scale membership platform achieves high performance and fault tolerance through dual‑center Elasticsearch clusters, traffic‑isolated architectures, Redis caching with distributed locks, sharded MySQL migration, and fine‑grained flow‑control and degradation strategies.
Background: the membership system is a core service for order processing; after the merger of two companies, it must provide a unified member relationship across multiple apps and mini‑programs, handling billions of users and peak TPS exceeding 20,000.
ES High‑Availability Solution
Introduces a dual‑center primary‑backup Elasticsearch cluster deployed in two data centers (A and B). The primary cluster serves reads/writes, while updates are synchronized to the backup via MQ; failover switches traffic to the backup instantly if the primary fails.
Also presents a separate ES cluster dedicated to high‑TPS marketing activities to isolate traffic spikes.
ES Cluster Deep Optimization
Identifies several performance bottlenecks:
Uneven shard distribution causing hotspot nodes.
Thread‑pool size set too high, leading to CPU spikes.
Shard memory >100 GB; recommended ≤50 GB per shard.
String fields stored as both text and keyword , doubling storage.
Using query instead of filter for non‑scoring lookups.
Sorting results in ES instead of the JVM.
Missing routing keys causing unnecessary shard queries.
After applying these optimizations, CPU usage dropped dramatically and query latency improved, as shown in the following charts.
Redis Caching Strategy
Initially the system avoided caching due to near‑real‑time delay of ES (≈1 s) which could cause stale data in Redis. The solution adds a 2‑second distributed lock when updating ES, then deletes the related Redis entry; concurrent queries acquire the lock, detect the pending ES update, and skip cache writes, preventing inconsistency.
Redis Dual‑Center Multi‑Cluster Architecture
Deploys active‑active Redis clusters in data centers A and B. Writes are performed to both clusters; reads are served locally to minimize latency. If one data center fails, the other continues to provide full member services.
MySQL Dual‑Center Partition Cluster
Shards the >10 billion member records into over 1,000 shards (≈1 million rows each). Uses a 1‑master‑3‑slave topology with the master in data center A and slaves in B, synchronized over a dedicated link with <1 ms latency. DBRoute routes writes to the master and reads to the local data center, achieving >20,000 TPS and ~10 ms average latency.
Seamless Migration from SqlServer to MySQL
Describes a zero‑downtime migration strategy comprising full data sync, incremental sync, and real‑time dual‑write. During trial runs, writes go to SqlServer while an asynchronous thread writes to MySQL with retries and logging. After validation, traffic is gray‑scaled from 100 % SqlServer reads to MySQL using A/B testing, with automated result comparison to ensure consistency.
MySQL and ES Primary‑Backup Cluster
Provides a fallback path: if the DAL component or MySQL fails, reads are switched to Elasticsearch; once MySQL recovers, data is resynced and traffic switched back.
Abnormal Member Relationship Governance
Identifies and rectifies cases where user accounts become incorrectly bound across platforms, preventing cross‑account data leakage and erroneous order cancellations.
Future Fine‑Grained Flow Control and Degradation
Proposes hotspot throttling for abusive accounts, per‑account flow rules to curb buggy client bursts, and global rate limiting to protect the system from extreme traffic spikes. Also outlines degradation strategies based on average response time and error rates, and discusses challenges in managing caller accounts for precise control.
Laravel Tech Community
Specializing in Laravel development, we continuously publish fresh content and grow alongside the elegant, stable Laravel framework.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.