How to Build a High‑Performance, Highly Available Membership System with ES, Redis, and MySQL

This article explains how a large‑scale membership system achieves high performance and high availability by using a dual‑center Elasticsearch cluster, traffic‑isolated three‑cluster architecture, Redis caching with dual‑center clusters, and a MySQL partitioned dual‑center setup, while also detailing optimization, migration, and fine‑grained flow‑control strategies.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
How to Build a High‑Performance, Highly Available Membership System with ES, Redis, and MySQL

Background

The membership system is a core service tightly coupled with all business lines' order processes; any failure blocks ordering across the entire company, so it must be high‑performance and highly available.

After the merger of Tongcheng and eLong, the system must serve multiple platforms (apps, mini‑programs) and support cross‑marketing scenarios, leading to massive request volumes (over 20,000 TPS during peak holidays).

ES High‑Availability Solution

1. Dual‑Center Primary‑Backup ES Cluster

Because the unified member data exceeds billions, Elasticsearch (ES) is used for storage. The cluster is deployed in two data centers (A and B); the primary cluster runs in A, the backup in B. Writes go to the primary and are synchronized to the backup via MQ. If the primary fails, traffic is switched to the backup instantly, and after recovery the data is resynced.

2. ES Traffic‑Isolation Three‑Cluster Architecture

Requests are classified into two priority groups: order‑critical queries (high priority) and marketing‑driven high‑TPS queries (lower priority). A separate ES cluster handles the high‑TPS marketing traffic, isolating it from the primary cluster to protect the order flow.

3. Deep ES Optimizations

Uneven shard distribution caused hot nodes; rebalancing reduced hotspots.

Thread‑pool size was set too high; adjusted to cpu_core * 3 / 2 + 1 to lower CPU usage.

Shard memory size limited to 50 GB to avoid slow queries.

Removed duplicate text fields, keeping only keyword for member lookups.

Used filter instead of query to skip relevance scoring.

Moved result sorting to the member service JVM.

Added routing keys to target specific shards.

These changes dramatically reduced CPU usage and improved query latency.

Cache hit rate after Redis caching exceeded 90 % and relieved ES pressure.

Member Redis Cache Solution

Initially the system did not cache because ES performance was sufficient and data consistency was critical. However, a sudden spike during a ticket blind‑box event prompted the introduction of a cache.

1. Redis Dual‑Center Multi‑Cluster Architecture

Two Redis clusters are deployed in data centers A and B. Writes are performed on both (dual‑write) and succeed only when both succeed. Reads are served locally to minimize latency. This ensures service continuity even if one data center fails.

High‑Availability Member Primary‑DB Solution

The member detail data originally lived in a single SQL Server instance that reached physical limits with over a billion rows.

1. MySQL Dual‑Center Partitioned Cluster

Data is sharded into more than 1,000 partitions (~1 million rows each). The cluster uses a 1‑master‑3‑slave topology, with the master in data center A and slaves in B, synchronized over a dedicated link with sub‑millisecond latency. Reads are routed locally, writes go to the master.

2. Smooth Migration from SQL Server to MySQL

The migration is performed without downtime using a three‑stage approach: full data sync, real‑time dual‑write, and gradual traffic gray‑release (A/B testing). Consistency checks compare results from both databases; any discrepancy is logged and resolved before increasing traffic.

3. MySQL & ES Primary‑Backup Cluster

To guard against DAL component failures, data is also written to ES. If MySQL or the DAL fails, the system can switch reads to ES and later resynchronize once MySQL recovers.

Abnormal Member Relationship Governance

Complex bugs can cause cross‑account binding errors, leading to privacy breaches and financial loss. A detailed detection logic and code‑level safeguards were implemented to prevent such anomalies.

Future: Fine‑Grained Flow Control and Degradation

More precise flow‑control rules will target hot‑spot accounts, per‑caller thresholds, and global limits to protect the system from extreme traffic spikes.

1. Fine‑Grained Flow‑Control Strategies

Hot‑spot throttling for abusive accounts, per‑caller limits to catch buggy loops, and a global ceiling that quickly fails excess traffic beyond the system’s 30k TPS capacity.

2. Fine‑Grained Degradation Strategies

Degrade based on average response time or error rate of downstream services, automatically circuit‑breaking when thresholds are exceeded.

Source: DBAPlus
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend ArchitectureScalabilityElasticsearchhigh availabilityredismysql
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.