Designing a High‑Performance Membership System Using ES, Redis, and MySQL
This article details how a large‑scale membership platform achieves high performance and high availability by employing a dual‑center Elasticsearch cluster, traffic‑isolated ES clusters, deep ES optimizations, Redis caching with distributed locks, dual‑center MySQL partitioning, seamless data migration, and fine‑grained flow‑control and degradation strategies.
1. Background
The membership system is a core service tightly coupled with the order flow of all business lines; any failure prevents users from placing orders, affecting the entire company. Therefore, it must provide high performance, high availability, and stable, efficient service.
After the merger of Tongcheng and eLong, multiple platforms (Tongcheng APP, eLong APP, WeChat mini‑programs) need a unified member relationship for cross‑marketing and other scenarios, causing request volume and concurrency to surge (over 20k TPS during peak holidays). The article explains how the system maintains high performance and availability under such load.
2. Elasticsearch High‑Availability Solution
1. Dual‑Center Primary‑Backup ES Cluster Architecture
With over a billion members across platforms, Elasticsearch stores unified member relationships. Two data centers (A and B) host the primary and backup ES clusters respectively; writes go to the primary cluster and are synchronized to the backup via MQ. If the primary fails, traffic is switched to the backup cluster, and after recovery the data is synced back.
2. Traffic‑Isolated Three‑Cluster ES Architecture
To protect the primary ES cluster from marketing‑spike traffic, a separate ES cluster handles high‑TPS marketing requests, isolating them from the main cluster that serves order‑critical traffic.
3. Deep ES Cluster Optimizations
Balance shard distribution to avoid hotspot nodes.
Set thread‑pool size appropriately (no more than "cpu core * 3 / 2 + 1").
Limit shard memory to ≤50 GB per shard.
Remove unnecessary "text" fields, keeping only "keyword" for member queries.
Use filter instead of query to avoid relevance scoring.
Perform result sorting in the member service JVM to reduce ES load.
Add routing keys to direct queries to specific shards.
These optimizations dramatically reduced CPU usage and improved query latency, as shown in the performance charts.
3. Member Redis Caching Scheme
1. Solving Redis Inconsistency Caused by ES Near‑Real‑Time Delay
Because ES updates become visible after about one second, a race condition can cause stale data to be written back to Redis. The solution adds a 2‑second distributed lock in Redis when updating ES, deletes the cache, and prevents concurrent cache writes during the lock period.
2. Dual‑Center Multi‑Cluster Redis Architecture
Both data centers deploy a Redis cluster. Writes are performed to both clusters (dual‑write) and succeed only when both succeed. Reads are served locally to minimize latency, ensuring continuous service even if one data center fails.
4. High‑Availability Member Primary‑Database Scheme
1. MySQL Dual‑Center Partitioned Cluster
Member data (over a billion rows) is split into more than 1,000 shards, each holding about a million rows. The cluster uses a 1‑master‑3‑slave architecture with the master in data center A and slaves in data center B, synchronized via a dedicated link with sub‑millisecond latency. Reads are routed locally, writes go to the master.
2. Seamless Migration from SQL Server to MySQL
Zero‑downtime full data sync during low‑traffic periods.
Real‑time dual‑write to both databases with retry logic.
Gradual traffic gray‑release from SQL Server to MySQL (A/B testing), verifying result consistency before full cut‑over.
3. MySQL and ES Primary‑Backup Cluster Combination
If the DAL component or MySQL fails, reads/writes can be switched to Elasticsearch; once MySQL recovers, data is synchronized back and traffic is switched back.
5. Abnormal Member Relationship Governance
Complex bugs can cause cross‑account binding, leading to privacy breaches and financial loss. The team identified abnormal accounts through intricate logic and patched the vulnerabilities at the code level.
6. Outlook: Finer‑Grained Flow‑Control and Degradation Strategies
1. More Precise Flow‑Control
Hotspot control for accounts generating massive duplicate requests.
Per‑caller flow rules to limit traffic from buggy integrations.
Global flow limits to protect the system from traffic spikes beyond its capacity.
2. More Precise Degradation
Degrade based on average response time of dependent services.
Degrade based on error count or error ratio within a time window.
The team plans to audit all caller accounts to enable these fine‑grained controls.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
