Operations 19 min read

How We Achieved 20k TPS High‑Availability for a Billion‑User Membership System

This article details the design and implementation of a highly available, high‑performance membership system that serves over a billion users, covering Elasticsearch dual‑center HA, traffic‑isolated clusters, Redis caching, MySQL dual‑center partitioning, seamless migration, and refined flow‑control and degradation strategies.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How We Achieved 20k TPS High‑Availability for a Billion‑User Membership System

Background

The membership system is a core service for all business lines; any failure blocks order placement across the company. After the merger of two companies, the system must support cross‑platform member queries and handle peak traffic exceeding 20,000 TPS.

Elasticsearch High‑Availability

We deployed a dual‑center primary‑backup ES cluster: the primary cluster in data center A and the backup in data center B. Writes go to the primary; data is synchronized to the backup via MQ. In case of primary failure, traffic is switched to the backup with minimal downtime, and later synchronized back.

Traffic Isolation with Three‑Cluster Architecture

To protect the main ES cluster from marketing spikes, we created a separate ES cluster dedicated to high‑TPS promotional traffic, isolating it from the primary member‑query cluster.

Deep ES Optimizations

Balanced shard distribution to avoid hotspot nodes.

Adjusted thread‑pool size to cpu_core * 3 / 2 + 1 to prevent CPU spikes.

Limited shard size to ≤50 GB.

Removed unnecessary text fields, keeping only keyword for member lookups.

Used filter instead of query for non‑scoring searches.

Off‑loaded sorting to the application JVM.

Added routing keys to target specific shards.

These changes reduced CPU usage and improved query latency dramatically.

Redis Caching Solution

Initially we avoided caching due to real‑time consistency requirements, but a high‑traffic blind‑box event forced us to adopt a cache. We introduced a 2‑second distributed lock during ES updates, then deleted the stale Redis entry. This prevented stale data from being written back to the cache.

Redis High‑Availability

A dual‑center multi‑cluster Redis setup replicates writes to both data centers; reads are served locally to minimize latency. If one data center fails, the other continues serving member data.

High‑Availability Member Master Database

Member registration data moved from SqlServer to a dual‑center MySQL partitioned cluster (over 1,000 shards, 1 M rows each). The primary resides in data center A, with three replicas in data center B, synchronized over a dedicated line with sub‑millisecond latency.

Seamless Migration from SqlServer to MySQL

We performed full data sync during low traffic, enabled real‑time dual‑write, and used incremental sync to bridge the gap. A gray‑rollout gradually shifted read traffic from SqlServer to MySQL, with automated verification of query consistency.

Fallback to Elasticsearch on DAL Failure

If the DAL component or MySQL fails, reads/writes can be switched to Elasticsearch, then synchronized back once the primary database recovers.

Abnormal Member Account Governance

Complex bugs caused cross‑account binding errors; we identified and patched these at the code level, preventing severe customer impact.

Refined Flow‑Control and Degradation Strategies

We implemented hotspot limiting, per‑caller flow rules, and global throttling to protect the system from extreme traffic bursts. Degradation mechanisms trigger based on average response time, error count, or error ratio, automatically circuit‑breaking affected services.

Future Outlook

Further work will focus on even finer‑grained flow control and degradation policies to handle evolving traffic patterns and ensure resilient operation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System ArchitectureElasticsearchhigh availabilityredismysqltraffic isolation
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.