High‑Availability Architecture for a Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, MySQL Migration, and Flow‑Control Strategies
The article details a comprehensive high‑availability solution for a large‑scale membership system, covering Elasticsearch dual‑center master‑slave clusters, traffic‑isolated three‑cluster designs, deep ES optimizations, Redis caching with consistency safeguards, MySQL partitioned migration, and fine‑grained flow‑control and degradation mechanisms.
The membership system is a core service that must guarantee high performance and high availability across all business lines; failures would block order placement for the entire company.
To support billions of members and multi‑platform integration (e.g., Ctrip and eLong apps and mini‑programs), the team adopted Elasticsearch (ES) as the unified member store and designed a dual‑center master‑slave ES cluster spanning two data centers (A and B). Reads/writes go to the primary cluster in A, while data is asynchronously replicated to the backup cluster in B via MQ. In case of a primary‑cluster outage, traffic is switched to the backup cluster with minimal downtime.
Because a single marketing campaign once caused a traffic surge that threatened the ES cluster, the architecture was extended with three isolated ES clusters: a primary cluster for order‑critical requests, a dedicated high‑TPS cluster for marketing bursts, and a third cluster for other workloads, ensuring that spikes do not affect the main order flow.
Further ES optimizations included balancing shard distribution, tuning thread‑pool sizes (cpu_core*3/2+1), limiting shard memory to ≤50 GB, removing unnecessary text fields, using filter queries instead of scoring queries, and adding routing keys to reduce shard fan‑out.
Redis caching was introduced to alleviate ES pressure. A dual‑center Redis setup writes to both clusters and reads locally; a distributed lock and delayed cache invalidation prevent stale data caused by ES’s near‑real‑time write latency.
The original SQL Server member database reached physical limits; a MySQL dual‑center partitioned cluster (1000+ shards, 1‑master‑3‑slave per center) replaced it. Migration employed full data sync, real‑time dual‑write with retry logic, and gradual A/B traffic gray‑release, verifying consistency before full cut‑over.
To guard against DAL component failures, a fallback path writes to ES; if MySQL or DAL fails, reads are switched to ES until recovery.
Finally, the system implements fine‑grained flow‑control (hotspot limits, per‑caller rules, global caps) and degradation strategies based on response time, error count, and error ratio, ensuring graceful handling of overloads and failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
