High‑Availability Architecture for a Membership System: Elasticsearch Dual‑Center, Redis Caching, MySQL Migration and Fine‑Grained Flow Control
This article details the design and implementation of a high‑availability membership system, covering Elasticsearch dual‑center master‑slave clusters, traffic‑isolated three‑cluster ES architecture, Redis multi‑center caching, MySQL dual‑center partitioning, data migration strategies, and refined flow‑control and degradation mechanisms to ensure stable, low‑latency service under massive concurrent load.
Background
The membership system is a core service that supports all business lines; any failure blocks order placement across the company. After the merger of Tongcheng and eLong, the system must handle cross‑platform member queries and traffic spikes exceeding 20,000 TPS during holidays.
1. Elasticsearch High‑Availability Solution
We deploy a dual‑center ES master‑slave cluster: the primary cluster in Data Center A and the standby cluster in Data Center B. Data is synchronized via MQ, and read/write can be switched to the standby cluster instantly if the primary fails.
To isolate high‑TPS marketing traffic, we add a third ES cluster dedicated to flash‑sale requests, separating it from the main ES cluster.
2. ES Deep Optimization
Balance shard distribution to avoid hot nodes.
Set thread‑pool size to cpu_core * 3 / 2 + 1 to prevent CPU spikes.
Limit shard memory to ≤50 GB.
Remove unnecessary text fields, keep only keyword for member queries.
Prefer filter over query to skip relevance scoring.
Use routing keys to target specific shards.
These optimizations reduced CPU usage and improved query latency dramatically.
3. Redis Caching Scheme
We introduced a dual‑center multi‑cluster Redis architecture. Writes are performed to both data centers; reads are served locally. A 2‑second distributed lock prevents cache inconsistency caused by Elasticsearch’s near‑real‑time delay.
Cache hit rate exceeds 90 %, greatly relieving ES pressure.
4. High‑Availability MySQL Primary Store
We migrated from a single‑instance SqlServer to a dual‑center MySQL partitioned cluster (over 1,000 shards, 1 master + 3 slaves). Data is routed via DBRoute: writes go to the master in Data Center A, reads are local. The cluster sustains >20,000 TPS with ~10 ms latency.
Migration employed a “full sync + incremental sync + real‑time gray‑switch” strategy, using dual‑write, retry logic, and A/B traffic shading to ensure data consistency.
After MySQL is fully operational, we add an ES master‑slave backup; if MySQL or DAL components fail, reads/writes can fall back to ES and later resynchronize.
5. Abnormal Member Relationship Governance
We identified and fixed rare cases where member accounts became incorrectly linked, preventing cross‑account data leakage and ensuring correct order visibility.
6. Future: Fine‑Grained Flow Control and Degradation
We plan to implement hotspot throttling, per‑caller flow rules, and global traffic caps to protect the system from extreme spikes, as well as response‑time‑based and error‑rate‑based circuit‑breakers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
