Achieving 20k+ TPS with Dual‑Center ES, Redis & MySQL for a Membership System
This article details how a billion‑user membership platform achieved high performance and high availability by deploying dual‑center Elasticsearch clusters, traffic‑isolated ES architectures, extensive ES optimizations, a Redis caching layer with distributed locking, a dual‑center partitioned MySQL migration, and fine‑grained flow‑control and degradation strategies.
Background
The membership system is a core service that directly supports the order flow of all business lines; any failure blocks user ordering across the entire company. After the merger of two companies, multiple platforms (apps, mini‑programs) require a unified member relationship, pushing request volume to over 20,000 TPS during peak holidays.
ES High‑Availability Solution
1. Dual‑center primary‑backup ES cluster: the primary cluster runs in data center A, the backup in data center B, with data synchronized via MQ. If the primary fails, reads and writes are switched to the backup instantly, and after recovery the data is synced back.
2. Three‑cluster traffic isolation: a dedicated ES cluster handles high‑TPS marketing activities, keeping order‑critical traffic on the primary cluster.
3. Deep ES optimizations: balanced shard distribution, proper thread‑pool sizing (no more than cpu_core*3/2+1), shard memory limited to 50 GB, removal of unnecessary text fields, using filter queries instead of query, moving sorting to application memory, and adding routing keys. These changes dramatically reduced CPU usage and improved query latency.
Redis Caching Scheme
Initially the system did not use caching because ES performance was sufficient and data consistency was critical. However, a sudden traffic surge during a ticket blind‑box event prompted the introduction of a Redis cache. Near‑real‑time ES updates cause a 1‑second delay, leading to cache inconsistency; this is solved by applying a 2‑second distributed lock before deleting the cache entry, ensuring that stale data is not written back.
Redis is deployed in a dual‑center multi‑cluster mode with synchronous writes; reads are served locally to minimize latency, providing high availability even if one data center fails.
High‑Availability Member Primary Database
MySQL replaces the original SqlServer, which could no longer store the growing >1 billion member records. A dual‑center partitioned MySQL cluster with over 1,000 shards (≈1 million rows per shard) is used, employing a 1‑master‑3‑slave setup per data center. Writes go to the master in data center A, reads are served locally, achieving sub‑millisecond latency.
The migration follows a three‑stage approach: full data sync, incremental sync, and real‑time dual‑write. During the trial period, writes go to SqlServer while an asynchronous thread writes to MySQL with retries; after stability is confirmed, traffic is gradually shifted to MySQL using A/B gray‑release with validation of query results.
Failover strategy: if the DAL component or MySQL fails, reads/writes can be switched to Elasticsearch, and once the database recovers, data is synchronized back.
Abnormal Member Relationship Governance
Complex bugs could bind a user's app account to another user's WeChat account, exposing orders across accounts. A thorough analysis and code‑level safeguards were implemented to detect and correct such abnormal bindings.
Future Outlook: Fine‑Grained Flow Control and Degradation
Three levels of flow control are planned: hotspot control for accounts with excessive repeated requests, per‑caller rules to prevent bugs from causing traffic spikes, and global limits to reject traffic that exceeds the system’s capacity (e.g., >10 k TPS) while preserving normal traffic.
Degradation strategies include response‑time based circuit breaking, error‑rate based throttling, and stricter management of caller accounts to ensure appropriate flow‑control policies.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
