How We Achieved 20k+ TPS High Availability for a Billion‑User Membership System
This article details the design and implementation of a highly available, high‑performance membership system serving over a billion users, covering Elasticsearch dual‑center clusters, traffic isolation, Redis caching, MySQL migration, and fine‑grained flow‑control and degradation strategies.
1. Background
The membership system is a core service tightly linked to the order flow of all business lines; any failure blocks user orders across the entire company, so it must guarantee high performance and high availability.
After the merger of Tongcheng and eLong, multiple platforms (Tongcheng APP, eLong APP, WeChat mini‑programs) need a unified member relationship for cross‑marketing and other scenarios, leading to rapidly increasing request volume and peak TPS exceeding 20,000 during holidays.
2. Elasticsearch High‑Availability Solution
2.1 Dual‑Center Primary‑Backup Cluster
We deploy the primary ES cluster in Data Center A and a backup cluster in Data Center B. Reads and writes go to the primary; data is synchronized to the backup via MQ. In case of primary failure, the system switches reads/writes to the backup with minimal downtime, then syncs data back once the primary recovers.
2.2 Three‑Cluster Traffic Isolation
We categorize requests into two groups: high‑priority order‑related requests and high‑TPS marketing requests. A dedicated ES cluster handles marketing traffic, isolating it from the primary cluster to protect the order flow.
2.3 Deep ES Optimizations
Balance shard distribution to avoid hot nodes.
Set thread pool size to "cpu core * 3 / 2 + 1" to prevent CPU spikes.
Limit shard memory to ≤50 GB.
Remove unnecessary text fields, keep only keyword for member queries.
Use filter instead of query to avoid scoring overhead.
Perform result sorting in the member service JVM.
Introduce routing keys to target specific shards.
After these optimizations, CPU usage dropped dramatically and query latency improved significantly.
3. Member Redis Caching Scheme
Initially, the system avoided caching due to near‑real‑time ES performance and the risk of stale data. However, a sudden spike from a blind‑box promotion forced us to introduce caching.
3.1 Solving ES‑to‑Redis Inconsistency
Because ES updates become visible after ~1 second, a race could cause stale data to be written back to Redis. We added a 2‑second distributed lock in Redis during ES updates and deleted the cache entry; queries that acquire the lock skip cache writes, preventing inconsistency.
3.2 Dual‑Center Multi‑Cluster Redis
We deploy Redis clusters in both Data Center A and B. Writes are performed to both clusters; reads are served locally to reduce latency. This ensures service continuity even if one data center fails.
4. High‑Availability Member Primary Database
4.1 MySQL Dual‑Center Partition Cluster
We migrated from a single SqlServer instance (over 10 billion rows) to a dual‑center MySQL cluster with >1,000 shards, each holding ~1 million rows. The cluster uses a 1‑master‑3‑slave architecture, with the master in Data Center A and slaves in Data Center B, synchronized via a dedicated link with sub‑millisecond latency.
Stress tests showed >20k TPS with average latency under 10 ms.
4.2 Smooth Migration Strategy
We performed a full data sync from SqlServer to MySQL during low traffic, then enabled real‑time dual‑write (SqlServer + MySQL) with retry and logging mechanisms. Gradual traffic gray‑release (A/B testing) validated consistency before fully switching reads to MySQL.
4.3 MySQL‑ES Primary‑Backup Cluster
To guard against DAL component failures, we also replicate member data to an ES cluster. If MySQL or DAL fails, reads/writes can be switched to ES, then synchronized back once MySQL recovers.
5. Abnormal Member Relationship Governance
We identified and fixed rare cases where user accounts became incorrectly bound across platforms, which could lead to privacy breaches and financial loss. Complex detection logic and code‑level safeguards were added to eliminate such anomalies.
6. Outlook: Finer‑Grained Flow‑Control and Degradation
6.1 Refined Flow‑Control
Hotspot limiting for accounts generating excessive duplicate requests.
Per‑caller flow rules to prevent bugs that cause request storms.
Global limits to reject traffic beyond the system’s 30k TPS capacity, protecting core services.
6.2 Refined Degradation
Degrade based on average response time of dependent services.
Degrade when exception count or ratio exceeds thresholds within a minute.
We also plan to audit all member‑calling accounts to align them with appropriate flow‑control and degradation policies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
