How to Build a High‑Performance, Highly Available Membership System with ES, Redis, and MySQL
This article explains how a large‑scale membership system achieves high performance and high availability by using a dual‑center Elasticsearch cluster, traffic‑isolated three‑cluster architecture, Redis caching with dual‑center clusters, and a MySQL partitioned dual‑center setup, while also detailing optimization, migration, and fine‑grained flow‑control strategies.
Background
The membership system is a core service tightly coupled with all business lines' order processes; any failure blocks ordering across the entire company, so it must be high‑performance and highly available.
After the merger of Tongcheng and eLong, the system must serve multiple platforms (apps, mini‑programs) and support cross‑marketing scenarios, leading to massive request volumes (over 20,000 TPS during peak holidays).
ES High‑Availability Solution
1. Dual‑Center Primary‑Backup ES Cluster
Because the unified member data exceeds billions, Elasticsearch (ES) is used for storage. The cluster is deployed in two data centers (A and B); the primary cluster runs in A, the backup in B. Writes go to the primary and are synchronized to the backup via MQ. If the primary fails, traffic is switched to the backup instantly, and after recovery the data is resynced.
2. ES Traffic‑Isolation Three‑Cluster Architecture
Requests are classified into two priority groups: order‑critical queries (high priority) and marketing‑driven high‑TPS queries (lower priority). A separate ES cluster handles the high‑TPS marketing traffic, isolating it from the primary cluster to protect the order flow.
3. Deep ES Optimizations
Uneven shard distribution caused hot nodes; rebalancing reduced hotspots.
Thread‑pool size was set too high; adjusted to cpu_core * 3 / 2 + 1 to lower CPU usage.
Shard memory size limited to 50 GB to avoid slow queries.
Removed duplicate text fields, keeping only keyword for member lookups.
Used filter instead of query to skip relevance scoring.
Moved result sorting to the member service JVM.
Added routing keys to target specific shards.
These changes dramatically reduced CPU usage and improved query latency.
Cache hit rate after Redis caching exceeded 90 % and relieved ES pressure.
Member Redis Cache Solution
Initially the system did not cache because ES performance was sufficient and data consistency was critical. However, a sudden spike during a ticket blind‑box event prompted the introduction of a cache.
1. Redis Dual‑Center Multi‑Cluster Architecture
Two Redis clusters are deployed in data centers A and B. Writes are performed on both (dual‑write) and succeed only when both succeed. Reads are served locally to minimize latency. This ensures service continuity even if one data center fails.
High‑Availability Member Primary‑DB Solution
The member detail data originally lived in a single SQL Server instance that reached physical limits with over a billion rows.
1. MySQL Dual‑Center Partitioned Cluster
Data is sharded into more than 1,000 partitions (~1 million rows each). The cluster uses a 1‑master‑3‑slave topology, with the master in data center A and slaves in B, synchronized over a dedicated link with sub‑millisecond latency. Reads are routed locally, writes go to the master.
2. Smooth Migration from SQL Server to MySQL
The migration is performed without downtime using a three‑stage approach: full data sync, real‑time dual‑write, and gradual traffic gray‑release (A/B testing). Consistency checks compare results from both databases; any discrepancy is logged and resolved before increasing traffic.
3. MySQL & ES Primary‑Backup Cluster
To guard against DAL component failures, data is also written to ES. If MySQL or the DAL fails, the system can switch reads to ES and later resynchronize once MySQL recovers.
Abnormal Member Relationship Governance
Complex bugs can cause cross‑account binding errors, leading to privacy breaches and financial loss. A detailed detection logic and code‑level safeguards were implemented to prevent such anomalies.
Future: Fine‑Grained Flow Control and Degradation
More precise flow‑control rules will target hot‑spot accounts, per‑caller thresholds, and global limits to protect the system from extreme traffic spikes.
1. Fine‑Grained Flow‑Control Strategies
Hot‑spot throttling for abusive accounts, per‑caller limits to catch buggy loops, and a global ceiling that quickly fails excess traffic beyond the system’s 30k TPS capacity.
2. Fine‑Grained Degradation Strategies
Degrade based on average response time or error rate of downstream services, automatically circuit‑breaking when thresholds are exceeded.
Source: DBAPlus
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
