High‑Availability Architecture for a Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, and MySQL Migration
The article describes how a large‑scale membership system achieves high performance and fault tolerance by deploying a dual‑center Elasticsearch cluster, isolating traffic with multiple ES clusters, adding a Redis cache with distributed locks, migrating the primary relational store from SQL Server to a partitioned MySQL cluster, and implementing fine‑grained flow‑control and degradation strategies.
Background – The membership system is a core service that supports order processing across all business lines; any outage would block orders company‑wide. After merging two platforms, the system must handle billions of members and peak traffic exceeding 20k TPS.
1. Elasticsearch High‑Availability – A dual‑center primary‑backup ES cluster is deployed across two data centers (A and B). The primary cluster handles reads/writes; data is replicated to the backup via MQ. In case of a node or data‑center failure, traffic is switched to the backup with minimal downtime. To further isolate heavy marketing traffic, a separate ES cluster handles high‑TPS marketing requests, preventing them from affecting the main order‑flow cluster.
2. ES Cluster Optimizations – Issues such as uneven shard distribution, oversized thread pools, large shard memory, redundant text/keyword fields, and use of query instead of filter were identified and corrected. Additional optimizations include routing keys, moving sorting to application memory, and reducing thread‑pool size, which dramatically lowered CPU usage and improved query latency.
3. Redis Caching Strategy – Initially the system avoided caching due to near‑real‑time ES updates and consistency concerns. After a traffic spike during a blind‑box promotion, a Redis cache was introduced with a 2‑second distributed lock to avoid stale data caused by ES’s 1‑second indexing delay. The cache now achieves >90% hit rate, relieving ES pressure.
4. High‑Availability Redis – Two Redis clusters are deployed in data centers A and B with dual‑write on updates and near‑real‑time reads from the local cluster, ensuring service continuity if one site fails.
5. Primary Database Migration – The original SQL Server instance reached physical limits with >10 billion rows. A partitioned MySQL cluster (1000+ shards, 1‑master‑3‑slave per shard) was built across two data centers, with DBRoute directing writes to the master in A and reads to the local replica. Performance testing showed >20k TPS with ~10 ms latency.
6. Migration Process – A three‑phase approach: full data sync, real‑time dual‑write, and gradual traffic gray‑release (A/B testing). Consistency checks compare results from both databases; any discrepancy is logged and resolved before increasing traffic share.
7. Fallback to Elasticsearch – If the DAL component or MySQL fails, reads/writes can be switched to ES, with later synchronization back to MySQL once it recovers.
8. Exception Member Governance – Complex logic identifies abnormal member bindings (e.g., cross‑account issues) and patches code paths to prevent data leakage and order manipulation.
9. Future Directions – Implement finer‑grained flow‑control (hotspot limiting, per‑caller rules, global throttling) and degradation strategies based on response time, error rate, and exception counts, as well as a systematic audit of all caller accounts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
