Dual‑Center Elasticsearch & Multi‑Cluster Redis Power 20k+ TPS for Billion‑User Membership
This article explains how a large‑scale membership system serving over a billion users achieved high performance and availability by deploying dual‑center Elasticsearch clusters, traffic‑isolated ES clusters, deep ES optimizations, a Redis caching layer with dual‑center replication, and a seamless migration from SQL Server to sharded MySQL, while also outlining future fine‑grained flow‑control and degradation strategies.
Background
The membership system is a core service that supports order placement across all business lines; any failure blocks user orders company‑wide. After the merger of two companies, the system must handle unified member queries across multiple platforms (apps, mini‑programs) and cope with traffic spikes exceeding 20,000 TPS during holidays.
ES High‑Availability Solution
To store unified member relationships, Elasticsearch (ES) is used. A dual‑center primary‑backup ES cluster is deployed: the primary cluster resides in Data Center A, the backup in Data Center B. Reads/writes go to the primary; data is synchronized to the backup via MQ. If the primary fails, configuration switches reads/writes to the backup with minimal downtime, then syncs data back once the primary recovers.
Because a single marketing campaign once caused TPS to surge and threaten the ES cluster, traffic is further isolated into three ES clusters: a primary cluster for critical order‑flow requests, a dedicated high‑TPS cluster for marketing/flash‑sale traffic, and a backup cluster for redundancy.
Deep ES optimizations include:
Balancing shard distribution to avoid hotspot nodes.
Setting thread‑pool size to cpu_cores * 3 / 2 + 1 to prevent CPU spikes.
Limiting shard memory to ≤50 GB.
Removing unnecessary text fields, keeping only keyword for exact matches.
Using filter instead of query to skip relevance scoring.
Moving result sorting to the application JVM.
Adding routing keys to direct queries to specific shards.
These measures reduced CPU usage dramatically and improved query latency, as shown by the performance charts.
Member Redis Cache Solution
Initially the system avoided caching to keep data strictly consistent, but a flash‑sale event exposed the need for a cache. A Redis layer was added with a dual‑center multi‑cluster architecture: writes are performed in both data centers, and reads are served locally to minimize latency. In case of a data‑center outage, the other center continues serving.
To handle ES’s near‑real‑time delay (≈1 s) that caused cache inconsistency, a 2‑second distributed lock is acquired before deleting the Redis entry after an ES update. If a concurrent read obtains the lock, it skips updating the cache, preventing stale data from being written back.
Additional concurrency safeguards ensure that delete‑and‑update operations on the cache are mutually exclusive, eliminating race conditions.
High‑Availability Member Primary Database Solution
The original SQL Server instance reached physical limits with over a billion rows. The migration path chosen is a dual‑center MySQL partitioned cluster: >1,000 shards (≈1 M rows each), 1‑master‑3‑slave topology, with the master in Data Center A and slaves in Data Center B. Synchronous writes to both data centers guarantee availability; reads are routed locally.
Migration steps:
Full data sync from SQL Server to MySQL during low‑traffic windows.
Enable real‑time dual‑write: primary writes to SQL Server, asynchronous writes to MySQL with retry logic (3 attempts, then log).
Incremental sync to bridge the gap between full sync and dual‑write.
Gradual traffic gray‑release: start with 100 % reads from SQL Server, shift 1 % to MySQL, monitor consistency, then increase until 100 % reads come from MySQL.
If MySQL or its DAL component fails, reads/writes can be switched to ES, and once MySQL recovers, data is synchronized back.
Abnormal Member Relationship Governance
Complex bugs caused cross‑account binding errors, leading to users seeing or modifying each other's orders. A thorough analysis identified abnormal accounts, and the member APIs were hardened with additional validation logic to block such scenarios.
Future: More Fine‑Grained Flow‑Control and Degradation Strategies
Planned enhancements include:
Hotspot throttling for accounts that generate excessive duplicate requests.
Per‑caller flow‑control rules to limit bursts caused by buggy client code.
Global flow‑control to cap overall TPS at the system’s safe threshold (≈30k TPS), rejecting excess traffic early.
Degradation based on average response time and error rates of downstream services, with automatic circuit‑breaker activation.
Additionally, a systematic audit of all member‑service caller accounts will be performed to align them with appropriate flow‑control and degradation policies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Brother's Insights
A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
