Backend Development 18 min read

High‑Availability Architecture for a Large‑Scale Membership System

This article details the design and implementation of a high‑availability, high‑performance membership system that serves billions of users across multiple platforms, covering Elasticsearch dual‑center clusters, traffic‑isolated three‑cluster setups, Redis caching strategies, MySQL dual‑center partitioning, and advanced flow‑control and degradation mechanisms.

Architecture Digest

Oct 30, 2022

High‑Availability Architecture for a Large‑Scale Membership System

The membership system is a core service tightly coupled with the order flow of all business lines; any failure would block user orders across the entire company, so it must guarantee high performance and high availability.

After the merger of two companies, the system needed to unify member data across multiple apps and mini‑programs, leading to massive request volumes (over 20,000 TPS during peak periods). To achieve high availability, a dual‑center Elasticsearch master‑slave cluster was deployed across two data centers, with automatic failover via configuration switches and data synchronization via MQ.

To further isolate traffic, a three‑cluster Elasticsearch architecture was introduced: a primary cluster for critical order‑related queries, a secondary cluster for high‑TPS marketing activities, and a third cluster for other workloads, enabling fine‑grained isolation, circuit breaking, and rate limiting.

Deep Elasticsearch optimizations were performed, including balanced shard distribution, appropriate thread‑pool sizing, limiting shard memory to under 50 GB, removing unnecessary text fields, using filters instead of queries, moving sorting to the application layer, and adding routing keys, resulting in significant CPU reduction and query latency improvements.

Redis caching was added to alleviate ES pressure, with a dual‑center multi‑cluster setup that writes to both data centers and reads locally, achieving over 90% cache hit rate. A distributed lock mechanism was introduced to handle the near‑real‑time delay of ES writes and prevent cache inconsistency.

The relational member database was migrated from a single‑instance SQL Server to a dual‑center MySQL partitioned cluster (over 1,000 shards, 1‑master‑3‑slave per center). Data migration employed full sync, incremental sync, and real‑time dual‑write with retry logic, followed by gradual traffic gray‑release using A/B testing and result verification.

To guard against component failures, a fallback path was built where the DAL could switch reads/writes to Elasticsearch if MySQL became unavailable, with later synchronization back to MySQL.

Abnormal member relationships were identified and mitigated through complex logic checks and code‑level safeguards, preventing cross‑account data leakage.

Future work focuses on more granular flow‑control (hotspot limiting, per‑caller rules, global throttling) and degradation strategies based on response time and error rates, ensuring the system remains resilient under extreme load.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

System Architecture Elasticsearch high availability Partitioning

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.