High‑Availability Architecture for a Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, and MySQL Migration

This article details the design and implementation of a high‑performance, highly available membership system, covering Elasticsearch dual‑center master‑slave clusters, traffic‑isolated three‑cluster ES architecture, Redis cache strategies, MySQL dual‑center partitioning, seamless migration, abnormal member handling, and fine‑grained flow‑control and degradation policies.

Architecture Digest
Architecture Digest
Architecture Digest
High‑Availability Architecture for a Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, and MySQL Migration

1. Background

The membership system is a core service for all business lines; any failure blocks order placement across the company. After the merger of Tongcheng and eLong, the system must support cross‑platform member queries (APP, WeChat mini‑programs) with traffic reaching over 20k TPS during peak periods.

2. Elasticsearch High‑Availability Solution

2.1 ES Dual‑Center Master‑Slave Cluster

Two data centers (A and B) host a primary ES cluster in A and a standby cluster in B. Data is replicated via MQ; in case of primary failure, the membership service switches reads/writes to the standby cluster with minimal downtime.

2.2 ES Traffic Isolation Three‑Cluster Architecture

Separate ES clusters handle critical order‑flow queries and high‑TPS marketing activities, preventing marketing spikes from affecting the main order flow.

2.3 ES Deep Optimization

Balanced shard distribution to avoid hot nodes.

Thread‑pool size limited to cpu_core * 3 / 2 + 1.

Shard memory kept below 50 GB.

Removed unnecessary text fields, using only keyword.

Used filter instead of query for non‑scoring lookups.

Moved result sorting to the membership service JVM.

Added routing keys to target specific shards.

These optimizations reduced CPU usage and improved query latency dramatically.

3. Membership Redis Cache Scheme

Because ES is near‑real‑time (≈1 s delay), a race condition could cause stale data in Redis. The solution adds a 2‑second distributed lock before deleting the Redis entry, ensuring that queries during the lock do not rewrite stale data.

After applying the cache, hit rates exceeded 90 %, greatly relieving ES pressure.

3.2 Redis Dual‑Center Multi‑Cluster Architecture

Both data centers host a Redis cluster; writes are duplicated to both, reads are served locally. This provides seamless failover if one center goes down.

4. High‑Availability Membership Primary Database Scheme

Member registration data migrated from a saturated SqlServer to a dual‑center MySQL partitioned cluster (over 1 000 shards, 1 M rows per shard). Master resides in data center A, slaves in B, with sub‑millisecond replication.

Stress tests showed >20k TPS with ~10 ms average latency.

4.2 Seamless Migration Strategy

Implemented full data sync, real‑time dual‑write, and incremental sync. During migration, traffic was gradually shifted from SqlServer to MySQL using A/B testing, with automated consistency checks and retry logic.

4.3 MySQL‑ES Master‑Slave Scheme

In case of DAL component failure or MySQL outage, reads/writes can be switched to ES, with later synchronization back to MySQL.

5. Abnormal Member Relationship Governance

Identified and fixed rare cases where cross‑account binding errors caused users to see or modify others' orders, using deep logic checks and code‑level safeguards.

6. Outlook: More Fine‑Grained Flow Control and Degradation

Plans include hotspot‑based throttling for abusive accounts, per‑caller flow‑control rules, global traffic caps, and multi‑level degradation based on response time, error rate, and exception count.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System ArchitectureElasticsearchhigh availabilityredismysqlFlow Control
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.