High‑Availability Architecture for a Membership System: Elasticsearch Dual‑Center Cluster, Redis Caching, and MySQL Migration
This article details the design and implementation of a high‑performance, highly available membership system, covering Elasticsearch dual‑center master‑slave clusters, traffic‑isolated three‑cluster ES architecture, Redis cache strategies, MySQL dual‑center partitioning, seamless migration, abnormal member handling, and fine‑grained flow‑control and degradation policies.
1. Background
The membership system is a core service for all business lines; any failure blocks order placement across the company. After the merger of Tongcheng and eLong, the system must support cross‑platform member queries (APP, WeChat mini‑programs) with traffic reaching over 20k TPS during peak periods.
2. Elasticsearch High‑Availability Solution
2.1 ES Dual‑Center Master‑Slave Cluster
Two data centers (A and B) host a primary ES cluster in A and a standby cluster in B. Data is replicated via MQ; in case of primary failure, the membership service switches reads/writes to the standby cluster with minimal downtime.
2.2 ES Traffic Isolation Three‑Cluster Architecture
Separate ES clusters handle critical order‑flow queries and high‑TPS marketing activities, preventing marketing spikes from affecting the main order flow.
2.3 ES Deep Optimization
Balanced shard distribution to avoid hot nodes.
Thread‑pool size limited to cpu_core * 3 / 2 + 1.
Shard memory kept below 50 GB.
Removed unnecessary text fields, using only keyword.
Used filter instead of query for non‑scoring lookups.
Moved result sorting to the membership service JVM.
Added routing keys to target specific shards.
These optimizations reduced CPU usage and improved query latency dramatically.
3. Membership Redis Cache Scheme
Because ES is near‑real‑time (≈1 s delay), a race condition could cause stale data in Redis. The solution adds a 2‑second distributed lock before deleting the Redis entry, ensuring that queries during the lock do not rewrite stale data.
After applying the cache, hit rates exceeded 90 %, greatly relieving ES pressure.
3.2 Redis Dual‑Center Multi‑Cluster Architecture
Both data centers host a Redis cluster; writes are duplicated to both, reads are served locally. This provides seamless failover if one center goes down.
4. High‑Availability Membership Primary Database Scheme
Member registration data migrated from a saturated SqlServer to a dual‑center MySQL partitioned cluster (over 1 000 shards, 1 M rows per shard). Master resides in data center A, slaves in B, with sub‑millisecond replication.
Stress tests showed >20k TPS with ~10 ms average latency.
4.2 Seamless Migration Strategy
Implemented full data sync, real‑time dual‑write, and incremental sync. During migration, traffic was gradually shifted from SqlServer to MySQL using A/B testing, with automated consistency checks and retry logic.
4.3 MySQL‑ES Master‑Slave Scheme
In case of DAL component failure or MySQL outage, reads/writes can be switched to ES, with later synchronization back to MySQL.
5. Abnormal Member Relationship Governance
Identified and fixed rare cases where cross‑account binding errors caused users to see or modify others' orders, using deep logic checks and code‑level safeguards.
6. Outlook: More Fine‑Grained Flow Control and Degradation
Plans include hotspot‑based throttling for abusive accounts, per‑caller flow‑control rules, global traffic caps, and multi‑level degradation based on response time, error rate, and exception count.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
