How We Achieved 20k+ TPS High Availability for a Billion‑User Membership System

This article details the design and implementation of a highly available, high‑performance membership system serving over a billion users, covering Elasticsearch dual‑center clusters, traffic isolation, Redis caching, MySQL migration, and fine‑grained flow‑control and degradation strategies.

21CTO
21CTO
21CTO
How We Achieved 20k+ TPS High Availability for a Billion‑User Membership System

1. Background

The membership system is a core service tightly linked to the order flow of all business lines; any failure blocks user orders across the entire company, so it must guarantee high performance and high availability.

After the merger of Tongcheng and eLong, multiple platforms (Tongcheng APP, eLong APP, WeChat mini‑programs) need a unified member relationship for cross‑marketing and other scenarios, leading to rapidly increasing request volume and peak TPS exceeding 20,000 during holidays.

2. Elasticsearch High‑Availability Solution

2.1 Dual‑Center Primary‑Backup Cluster

We deploy the primary ES cluster in Data Center A and a backup cluster in Data Center B. Reads and writes go to the primary; data is synchronized to the backup via MQ. In case of primary failure, the system switches reads/writes to the backup with minimal downtime, then syncs data back once the primary recovers.

2.2 Three‑Cluster Traffic Isolation

We categorize requests into two groups: high‑priority order‑related requests and high‑TPS marketing requests. A dedicated ES cluster handles marketing traffic, isolating it from the primary cluster to protect the order flow.

2.3 Deep ES Optimizations

Balance shard distribution to avoid hot nodes.

Set thread pool size to "cpu core * 3 / 2 + 1" to prevent CPU spikes.

Limit shard memory to ≤50 GB.

Remove unnecessary text fields, keep only keyword for member queries.

Use filter instead of query to avoid scoring overhead.

Perform result sorting in the member service JVM.

Introduce routing keys to target specific shards.

After these optimizations, CPU usage dropped dramatically and query latency improved significantly.

3. Member Redis Caching Scheme

Initially, the system avoided caching due to near‑real‑time ES performance and the risk of stale data. However, a sudden spike from a blind‑box promotion forced us to introduce caching.

3.1 Solving ES‑to‑Redis Inconsistency

Because ES updates become visible after ~1 second, a race could cause stale data to be written back to Redis. We added a 2‑second distributed lock in Redis during ES updates and deleted the cache entry; queries that acquire the lock skip cache writes, preventing inconsistency.

3.2 Dual‑Center Multi‑Cluster Redis

We deploy Redis clusters in both Data Center A and B. Writes are performed to both clusters; reads are served locally to reduce latency. This ensures service continuity even if one data center fails.

4. High‑Availability Member Primary Database

4.1 MySQL Dual‑Center Partition Cluster

We migrated from a single SqlServer instance (over 10 billion rows) to a dual‑center MySQL cluster with >1,000 shards, each holding ~1 million rows. The cluster uses a 1‑master‑3‑slave architecture, with the master in Data Center A and slaves in Data Center B, synchronized via a dedicated link with sub‑millisecond latency.

Stress tests showed >20k TPS with average latency under 10 ms.

4.2 Smooth Migration Strategy

We performed a full data sync from SqlServer to MySQL during low traffic, then enabled real‑time dual‑write (SqlServer + MySQL) with retry and logging mechanisms. Gradual traffic gray‑release (A/B testing) validated consistency before fully switching reads to MySQL.

4.3 MySQL‑ES Primary‑Backup Cluster

To guard against DAL component failures, we also replicate member data to an ES cluster. If MySQL or DAL fails, reads/writes can be switched to ES, then synchronized back once MySQL recovers.

5. Abnormal Member Relationship Governance

We identified and fixed rare cases where user accounts became incorrectly bound across platforms, which could lead to privacy breaches and financial loss. Complex detection logic and code‑level safeguards were added to eliminate such anomalies.

6. Outlook: Finer‑Grained Flow‑Control and Degradation

6.1 Refined Flow‑Control

Hotspot limiting for accounts generating excessive duplicate requests.

Per‑caller flow rules to prevent bugs that cause request storms.

Global limits to reject traffic beyond the system’s 30k TPS capacity, protecting core services.

6.2 Refined Degradation

Degrade based on average response time of dependent services.

Degrade when exception count or ratio exceeds thresholds within a minute.

We also plan to audit all member‑calling accounts to align them with appropriate flow‑control and degradation policies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System ArchitectureScalabilityElasticsearchhigh availabilityredismysql
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.