How We Achieved 20k TPS High‑Availability for a Billion‑User Membership System

This article details the design and implementation of a highly available, high‑performance membership system serving billions of users, covering Elasticsearch dual‑center clusters, traffic‑isolated architectures, deep ES optimizations, Redis caching with distributed locks, dual‑center MySQL partitioning, migration strategies, abnormal account handling, and future fine‑grained flow‑control and degradation policies.

Java Interview Crash Guide
Java Interview Crash Guide
Java Interview Crash Guide
How We Achieved 20k TPS High‑Availability for a Billion‑User Membership System

Background

The membership system is a core service tightly coupled with the order flow of all business lines; any failure blocks user orders across the company. After the merger of Tongcheng and eLong, the system must support cross‑platform member queries and handle peak traffic exceeding 20,000 TPS.

ES High‑Availability Solution

1. Dual‑center master‑slave ES cluster

Two data centers (A and B) host the primary and backup ES clusters. Writes go to the primary cluster in A and are synchronized to the backup in B via MQ. In case of primary failure, reads/writes switch to the backup with minimal downtime, then sync back after recovery.

2. ES traffic‑isolation three‑cluster architecture

To protect the main ES cluster from marketing‑spike traffic, a separate ES cluster handles high‑TPS marketing requests, isolating them from the primary cluster that serves order‑critical queries.

3. Deep ES optimization

Uneven shard distribution causing hot nodes.

Thread‑pool size set too high, leading to CPU spikes.

Shard memory >100 GB, slowing queries; keep each shard ≤50 GB.

Dual field (text + keyword) doubling storage; use keyword only.

Use filter instead of query for non‑scoring member lookups.

Move result sorting to the member service JVM.

Add routing keys to limit shard queries.

Member Redis Cache Solution

Initially, the system avoided caching due to real‑time consistency concerns. After a high‑traffic blind‑box event, a Redis cache was introduced with a 2‑second distributed lock to prevent stale data during Elasticsearch’s near‑real‑time delay.

Potential race conditions between cache deletion and update are resolved by making the lock and cache‑update mutually exclusive.

Redis high‑availability uses a dual‑center multi‑cluster setup: each data center runs a Redis cluster; writes are performed to both, reads are served locally to reduce latency. If one center fails, the other continues serving.

High‑Availability Member Primary Database

1. MySQL dual‑center partition cluster

Member data (over 10 billion rows) is sharded into >1,000 partitions (~1 million rows each). The cluster follows a 1‑master‑3‑slave architecture, with the master in data center A and slaves in B, synchronized over a dedicated link with sub‑millisecond latency.

Performance tests show >20 k TPS with average latency under 10 ms.

2. Smooth migration from SQL Server to MySQL

The migration adopts full‑sync, incremental sync, and real‑time gray‑scale traffic switching. Dual‑write ensures data is written to both databases; failures trigger retries and logging. Gray‑scale shifts traffic from SQL Server to MySQL gradually, verifying consistency at each step.

3. MySQL and ES master‑backup clusters

If the DAL component fails or MySQL is down, reads/writes can be switched to the ES cluster, with later synchronization back to MySQL once it recovers.

Abnormal Member Relationship Governance

Complex bugs caused cross‑account binding errors, leading to privacy and financial issues. A thorough analysis identified abnormal accounts, and deep code‑level safeguards were added to prevent such mismatches.

Future: Fine‑Grained Flow Control and Degradation Strategies

1. More precise flow‑control

Hot‑spot limiting for accounts generating excessive requests (e.g., fraud).

Per‑caller flow rules to curb buggy client loops.

Global throttling to reject traffic beyond the system’s 30 k TPS capacity, preserving core functionality.

2. More precise degradation

Degrade based on average response time of dependent services.

Degrade when exception count or ratio exceeds thresholds within a minute.

Improving caller‑account management will further enable targeted flow‑control and degradation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsScalabilityElasticsearchhigh availabilityrediscachingmysql
Java Interview Crash Guide
Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.