Building a High‑Performance, Highly Available Membership System with ES, Redis & MySQL
To ensure the massive, multi‑platform membership service remains fast and reliable, this article details a multi‑center architecture using Elasticsearch for unified member data, Redis caching, and MySQL partitioning, along with traffic isolation, fault‑tolerant syncing, and fine‑grained flow‑control and degradation strategies.
Background
The membership system is a core service tightly coupled with the order‑processing flow of all business lines. Any failure blocks user ordering across the entire company, so the system must guarantee high performance, high availability, and stable, efficient service.
After the merger of Tongcheng and eLong, the unified member data must be accessible from multiple platforms (Tongcheng APP, eLong APP, both WeChat mini‑programs, etc.). Cross‑marketing scenarios, such as sending a hotel coupon after a train ticket purchase, require querying a unified member relationship that spans different member schemas. The request volume has grown to over 20,000 TPS during peak holidays, prompting a deep dive into high‑performance, high‑availability design.
Elasticsearch High‑Availability方案
1. Dual‑center primary‑backup ES cluster
Both companies together hold more than ten hundred million members. To store the complex query dimensions (phone, WeChat unionid, eLong card number, etc.), Elasticsearch (ES) is used as the unified store.
When a node fails, its replica shard is promoted to primary, keeping the service alive. However, a single‑data‑center deployment cannot survive a whole‑data‑center outage, hardware failure, or massive traffic spikes. The solution is a dual‑center primary‑backup architecture:
ES primary cluster resides in Data Center A, the backup cluster in Data Center B. Writes go to the primary; data is synchronized to the backup via MQ. If the primary fails, configuration switches reads/writes to the backup instantly, and after recovery the data is synced back.
2. Traffic‑Isolation Three‑Cluster Architecture
During a holiday marketing event, a single request looped more than ten times against the member system, causing TPS to surge and nearly crash ES. To protect the core order‑flow, request sources are classified into two priority groups:
High‑priority: requests directly tied to user ordering; must be guaranteed.
Low‑priority: marketing‑related requests that can tolerate degradation.
A dedicated ES cluster handles the high‑TPS marketing traffic, isolating it from the primary cluster. This prevents marketing spikes from affecting order processing.
3. Deep ES Optimizations
Further tuning of the primary ES cluster addressed recurring alerts during peak periods:
Uneven shard distribution caused hot nodes; shards were re‑balanced to keep each node’s load similar.
Thread‑pool size was reduced to cpu_cores * 3 / 2 + 1 to avoid excessive context switching.
Shard memory was limited to ≤50 GB; oversized shards (≈100 GB) slowed queries.
String fields were stored only as keyword (no duplicated text) to halve storage.
Queries used filter instead of query because relevance scoring is unnecessary for member look‑ups.
Result sorting was moved to the member service JVM to reduce ES CPU load.
Routing keys were added so queries target only relevant shards, cutting unnecessary network traffic.
These changes dramatically lowered CPU usage and improved query latency, as shown in the following charts:
Redis Cache方案
Originally the member system did not use caching because ES already delivered sub‑5 ms latency at 30k TPS. However, a sudden “blind‑box” ticket promotion generated extreme spikes, prompting a cache layer for safety.
1. Solving ES Near‑Real‑Time Delay Inconsistency
ES’s near‑real‑time indexing means a newly added document becomes searchable only after ~1 second. During this window, a request may read stale data from ES and write it back to Redis, causing inconsistency. The fix adds a 2‑second distributed lock on the member ID when updating ES, then deletes the Redis entry. Subsequent reads acquire the lock; if it is held, they skip writing back to Redis and return the fresh ES result directly. Potential race conditions between “delete cache” and “write cache” are eliminated by making the two operations mutually exclusive.
2. Dual‑Center Multi‑Cluster Redis
Redis clusters are deployed in both Data Center A and B. Writes are performed to both clusters (dual‑write); reads are served locally to minimize latency. If one data center fails, the other continues to provide full member service.
High‑Availability Primary Database方案
Member registration details reside in a relational database. The original SQL Server reached physical limits after storing >10 billion rows.
1. MySQL Dual‑Center Partitioned Cluster
We split the member table into >1,000 shards (≈1 million rows each) and deployed a 1‑master‑3‑slave MySQL cluster. The master lives in Data Center A, slaves in Data Center B, synchronized over a dedicated line with <1 ms lag. Reads are routed locally; writes go to the master. Stress tests showed >20k TPS with average latency <10 ms.
2. Seamless Migration from SQL Server to MySQL
Migration had three challenges: zero‑downtime, complex legacy interfaces, and consistency of both historical and real‑time data. The solution combines:
Full data sync from SQL Server to MySQL.
Real‑time dual‑write (SQL Server primary, MySQL secondary) with retry logic and logging.
Gradual traffic gray‑scale using A/B routing: start with 1 % of reads on MySQL, monitor consistency, then ramp up to 100 %.
3. MySQL + ES Primary‑Backup Cluster
To guard against DAL component failures, we also replicate member data to an ES cluster. If the DAL or MySQL goes down, reads/writes can be switched to ES, and once MySQL recovers, data is synchronized back.
Abnormal Member Relationship Governance
Incorrect cross‑binding (e.g., an APP account linked to another user’s WeChat) can expose orders and allow unauthorized cancellations. Complex detection logic identifies such anomalies, and the service layer is hardened to prevent further occurrences.
Future: Fine‑Grained Flow Control & Degradation
1. Refined Flow‑Control
Hot‑spot limiting for accounts that generate massive duplicate requests.
Per‑caller flow rules to curb bugs that cause request loops.
Global thresholds to reject traffic that exceeds the system’s 30k TPS capacity, preserving core functionality.
2. Refined Degradation
Average response‑time based degradation: if dependent services exceed latency thresholds, the member API enters a pre‑degrade state and may circuit‑break.
Exception‑count and exception‑ratio based degradation: sustained high error rates trigger automatic circuit‑break.
Improving caller‑account governance is essential; each internal team must register a dedicated account with clear usage scenarios to enable precise flow‑control and degradation policies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Linux Cloud Computing Practice
Welcome to Linux Cloud Computing Practice. We offer high-quality articles on Linux, cloud computing, DevOps, networking and related topics. Dive in and start your Linux cloud computing journey!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
