Operations 20 min read

Achieving 20k+ TPS with Dual‑Center ES, Redis & MySQL for a Membership System

This article details how a billion‑user membership platform achieved high performance and high availability by deploying dual‑center Elasticsearch clusters, traffic‑isolated ES architectures, extensive ES optimizations, a Redis caching layer with distributed locking, a dual‑center partitioned MySQL migration, and fine‑grained flow‑control and degradation strategies.

ITFLY8 Architecture Home

Sep 28, 2022

Achieving 20k+ TPS with Dual‑Center ES, Redis & MySQL for a Membership System

Background

The membership system is a core service that directly supports the order flow of all business lines; any failure blocks user ordering across the entire company. After the merger of two companies, multiple platforms (apps, mini‑programs) require a unified member relationship, pushing request volume to over 20,000 TPS during peak holidays.

ES High‑Availability Solution

1. Dual‑center primary‑backup ES cluster: the primary cluster runs in data center A, the backup in data center B, with data synchronized via MQ. If the primary fails, reads and writes are switched to the backup instantly, and after recovery the data is synced back.

2. Three‑cluster traffic isolation: a dedicated ES cluster handles high‑TPS marketing activities, keeping order‑critical traffic on the primary cluster.

3. Deep ES optimizations: balanced shard distribution, proper thread‑pool sizing (no more than cpu_core*3/2+1), shard memory limited to 50 GB, removal of unnecessary text fields, using filter queries instead of query, moving sorting to application memory, and adding routing keys. These changes dramatically reduced CPU usage and improved query latency.

Redis Caching Scheme

Initially the system did not use caching because ES performance was sufficient and data consistency was critical. However, a sudden traffic surge during a ticket blind‑box event prompted the introduction of a Redis cache. Near‑real‑time ES updates cause a 1‑second delay, leading to cache inconsistency; this is solved by applying a 2‑second distributed lock before deleting the cache entry, ensuring that stale data is not written back.

Redis is deployed in a dual‑center multi‑cluster mode with synchronous writes; reads are served locally to minimize latency, providing high availability even if one data center fails.

High‑Availability Member Primary Database

MySQL replaces the original SqlServer, which could no longer store the growing >1 billion member records. A dual‑center partitioned MySQL cluster with over 1,000 shards (≈1 million rows per shard) is used, employing a 1‑master‑3‑slave setup per data center. Writes go to the master in data center A, reads are served locally, achieving sub‑millisecond latency.

The migration follows a three‑stage approach: full data sync, incremental sync, and real‑time dual‑write. During the trial period, writes go to SqlServer while an asynchronous thread writes to MySQL with retries; after stability is confirmed, traffic is gradually shifted to MySQL using A/B gray‑release with validation of query results.

Failover strategy: if the DAL component or MySQL fails, reads/writes can be switched to Elasticsearch, and once the database recovers, data is synchronized back.

Abnormal Member Relationship Governance

Complex bugs could bind a user's app account to another user's WeChat account, exposing orders across accounts. A thorough analysis and code‑level safeguards were implemented to detect and correct such abnormal bindings.

Future Outlook: Fine‑Grained Flow Control and Degradation

Three levels of flow control are planned: hotspot control for accounts with excessive repeated requests, per‑caller rules to prevent bugs from causing traffic spikes, and global limits to reject traffic that exceeds the system’s capacity (e.g., >10 k TPS) while preserving normal traffic.

Degradation strategies include response‑time based circuit breaking, error‑rate based throttling, and stricter management of caller accounts to ensure appropriate flow‑control policies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system architecture Scalability

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.