High‑Availability Architecture for a Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, MySQL Migration, and Flow‑Control Strategies

The article details a comprehensive high‑availability solution for a large‑scale membership system, covering Elasticsearch dual‑center master‑slave clusters, traffic‑isolated three‑cluster designs, deep ES optimizations, Redis caching with consistency safeguards, MySQL partitioned migration, and fine‑grained flow‑control and degradation mechanisms.

Top Architect
Top Architect
Top Architect
High‑Availability Architecture for a Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, MySQL Migration, and Flow‑Control Strategies

The membership system is a core service that must guarantee high performance and high availability across all business lines; failures would block order placement for the entire company.

To support billions of members and multi‑platform integration (e.g., Ctrip and eLong apps and mini‑programs), the team adopted Elasticsearch (ES) as the unified member store and designed a dual‑center master‑slave ES cluster spanning two data centers (A and B). Reads/writes go to the primary cluster in A, while data is asynchronously replicated to the backup cluster in B via MQ. In case of a primary‑cluster outage, traffic is switched to the backup cluster with minimal downtime.

Because a single marketing campaign once caused a traffic surge that threatened the ES cluster, the architecture was extended with three isolated ES clusters: a primary cluster for order‑critical requests, a dedicated high‑TPS cluster for marketing bursts, and a third cluster for other workloads, ensuring that spikes do not affect the main order flow.

Further ES optimizations included balancing shard distribution, tuning thread‑pool sizes (cpu_core*3/2+1), limiting shard memory to ≤50 GB, removing unnecessary text fields, using filter queries instead of scoring queries, and adding routing keys to reduce shard fan‑out.

Redis caching was introduced to alleviate ES pressure. A dual‑center Redis setup writes to both clusters and reads locally; a distributed lock and delayed cache invalidation prevent stale data caused by ES’s near‑real‑time write latency.

The original SQL Server member database reached physical limits; a MySQL dual‑center partitioned cluster (1000+ shards, 1‑master‑3‑slave per center) replaced it. Migration employed full data sync, real‑time dual‑write with retry logic, and gradual A/B traffic gray‑release, verifying consistency before full cut‑over.

To guard against DAL component failures, a fallback path writes to ES; if MySQL or DAL fails, reads are switched to ES until recovery.

Finally, the system implements fine‑grained flow‑control (hotspot limits, per‑caller rules, global caps) and degradation strategies based on response time, error count, and error ratio, ensuring graceful handling of overloads and failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchhigh availabilityredismysqlFlow Control
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.