How We Built a 20k TPS Highly Available Membership System with ES, Redis, and MySQL
This article explains how a large‑scale membership platform achieved over 20,000 TPS with high performance and zero downtime by deploying dual‑center Elasticsearch clusters, traffic‑isolated three‑cluster ES architecture, Redis caching with distributed locks, and a seamless MySQL migration and HA strategy.
Background
The membership system is a core service for all business lines; any outage blocks order placement across the entire company. After the merger of Tongcheng and eLong, the unified member base exceeded one billion records and traffic surged to more than 20 k TPS during peak holidays, demanding a highly performant and highly available architecture.
Elasticsearch High‑Availability Architecture
Elasticsearch stores the unified member relationship because of its ability to handle massive data volumes and diverse query dimensions (phone, WeChat UnionID, eLong card number, etc.). A dual‑center primary‑backup ES cluster is deployed: the primary cluster resides in Data Center A, the backup in Data Center B. Writes go to the primary; data is replicated to the backup via MQ. If the primary fails, traffic is switched to the backup within seconds, and after recovery the data is synchronized back.
To protect against traffic spikes from marketing campaigns, a three‑cluster isolation model is added: a dedicated ES cluster handles high‑TPS marketing‑related queries, while the primary cluster serves order‑critical requests. This prevents a single burst from affecting the main order flow.
Deep ES optimizations include:
Balancing shard distribution to avoid hot nodes.
Setting thread‑pool size to cpu_cores * 3 / 2 + 1 to prevent CPU overload.
Limiting shard memory to ≤ 50 GB.
Removing duplicate text fields when keyword suffices.
Using filter instead of query for non‑scoring member lookups.
Moving result sorting to the member service JVM.
Adding routing keys to direct queries to specific shards.
These changes reduced CPU usage dramatically and cut query latency, as shown by the CPU‑usage and interface‑response charts.
Redis Caching Strategy
Initially the system avoided caching because ES already delivered sub‑5 ms latency and real‑time consistency was required. However, a sudden ticket‑box promotion generated extreme concurrency, prompting the introduction of a Redis cache.
Because ES is near‑real‑time (≈ 1 s delay), a write‑then‑read race could cause stale data to be written back to Redis. The solution adds a 2‑second distributed lock when updating ES, deletes the corresponding Redis entry, and only repopulates the cache after the lock expires, preventing inconsistent writes.
A dual‑center multi‑cluster Redis deployment mirrors the ES HA design: each data center runs a full Redis cluster, writes are performed to both clusters, and reads are served locally. This ensures that even if one data center fails, the other can continue serving cached member data.
Member Primary Database Migration
The original SQL Server instance reached physical limits after storing over a billion rows. A dual‑center MySQL partitioned cluster was built, splitting the member table into > 1 000 shards (≈ 1 M rows each) with a 1‑master‑3‑slave topology (master in Data Center A, slaves in Data Center B). Network latency between centers is < 1 ms.
Migration follows a three‑stage plan:
Full data sync from SQL Server to MySQL.
Real‑time dual‑write: every new write goes to both databases; failures trigger retries and alerting.
Gradual gray‑release of read traffic via an A/B platform, starting at 1 % and ramping to 100 % after consistency checks.
During the gray‑release, asynchronous comparison of query results from both databases logs any discrepancy for manual investigation.
Exception Member Governance
Abnormal member bindings (e.g., an APP account mistakenly linked to another user’s WeChat account) can cause cross‑account order visibility and cancellations, leading to severe customer complaints. Complex detection logic identifies such anomalies, and the member API layer was hardened to block the faulty paths.
Future Fine‑Grained Flow Control and Degradation
Recognizing that no system can be 100 % error‑free, the team proposes more granular flow‑control and degradation policies:
Hot‑account throttling for accounts generating excessive requests (e.g., black‑market abuse).
Per‑caller flow limits to guard against buggy client loops.
Global rate limiting to cap total TPS at the system’s safe threshold (≈ 30 k TPS), allowing excess traffic to fail fast.
Degradation based on average response time, exception count, and exception ratio, with automatic circuit‑breaker activation.
Additionally, a systematic audit of all internal caller accounts will enable precise rule assignment for each usage scenario.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
