High‑Availability Architecture for a Membership System: Dual‑Center ES Cluster, Redis Caching, MySQL Migration, and Fine‑Grained Flow Control
This article presents a comprehensive engineering case study of a high‑traffic membership system, detailing the dual‑center Elasticsearch high‑availability design, traffic‑isolated three‑cluster ES architecture, Redis caching strategy, dual‑center MySQL partitioning and migration plan, abnormal member relationship governance, and future fine‑grained flow‑control and downgrade policies.
The membership system is a core service that links all business lines; any outage blocks order placement across the company, so it must guarantee high performance and high availability.
To achieve this, a dual‑center Elasticsearch (ES) primary‑backup cluster is deployed across two data centers (A and B). Reads and writes go to the primary cluster, while data is asynchronously synced to the backup via MQ. In case of primary failure, traffic is switched to the backup cluster within seconds.
After a traffic surge during a holiday promotion, a three‑cluster ES architecture was introduced: a dedicated high‑TPS cluster for marketing bursts, isolated from the primary ES cluster to protect the main order flow.
Further ES optimizations include balancing shard distribution, tuning thread‑pool sizes, limiting shard memory to 50 GB, removing unnecessary text fields, using filter queries instead of query, moving sorting to the application layer, and adding routing keys to reduce unnecessary shard requests.
Because the original system did not use caching, a Redis cache layer was added. A dual‑center multi‑cluster Redis setup writes to both data centers and reads locally, achieving >90% cache hit rate and relieving ES pressure.
The relational data store migrated from a single‑instance SqlServer to a dual‑center MySQL partitioned cluster (over 1,000 shards, 1 master + 3 slaves per center). A seamless migration strategy combines full data sync, real‑time dual‑write, and gradual traffic gray‑release with validation to ensure consistency.
To guard against DAL component failures, a fallback path writes data to ES as a secondary source; if MySQL or DAL fails, reads can be switched to ES and later synchronized back.
Abnormal member relationships caused by concurrent bugs were identified and mitigated by tightening business logic and implementing precise governance.
Looking forward, the system will adopt finer‑grained flow‑control (hotspot limiting, per‑account throttling, global TPS caps) and downgrade strategies based on response latency and error rates, with comprehensive account‑level governance.
ENDSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
