Backend Development 21 min read

High‑Availability Architecture and Optimization Strategies for a Large‑Scale Membership System

This article details the design, high‑availability mechanisms, traffic isolation, performance optimizations, caching, database migration, and refined flow‑control and degradation strategies employed to keep a billion‑user membership system reliable and performant under extreme load.

Top Architect

Dec 13, 2024

High‑Availability Architecture and Optimization Strategies for a Large‑Scale Membership System

The membership system is a core service that supports order processing across all business lines, requiring high performance and availability because any failure blocks user orders company‑wide.

ES High‑Availability Solution

A dual‑center primary‑backup Elasticsearch cluster is deployed across two data centers (A and B). The primary cluster handles reads/writes, while data is asynchronously replicated to the backup via MQ. In case of a primary outage, traffic is switched to the backup with minimal downtime.

Traffic Isolation with Three‑Cluster ES Architecture

To protect the primary cluster from traffic spikes caused by marketing campaigns, a separate ES cluster handles high‑TPS marketing requests, isolating them from the critical order‑flow cluster.

Deep ES Optimizations

Improvements include balancing shard distribution, tuning thread‑pool sizes, limiting shard memory to 50 GB, removing unnecessary text fields, using filters instead of queries, performing sorting in the application layer, and adding routing keys to reduce cross‑shard requests.

Member Data Caching with Redis

A dual‑center Redis setup writes to both data centers simultaneously and reads locally, providing high availability and reducing latency. A distributed lock ensures cache consistency with near‑real‑time Elasticsearch updates.

MySQL Dual‑Center Partition Cluster

The legacy SQL Server store is migrated to a sharded MySQL cluster (1000+ shards, 1‑master‑3‑slave per data center) with sub‑millisecond replication, supporting seamless read/write routing via DBRoute.

Zero‑Downtime Migration Strategy

The migration uses full data sync, incremental sync, and real‑time dual‑write. During a trial phase, writes go to SQL Server while asynchronous writes to MySQL are retried on failure; once stable, traffic is gradually shifted to MySQL using A/B testing and result verification.

Fallback to ES as Primary Store

If the DAL component or MySQL fails, reads/writes can be switched to Elasticsearch, with later synchronization back to MySQL once the issue is resolved.

Abnormal Account Governance

Complex logic identifies and fixes accounts that mistakenly bind multiple user identities, preventing cross‑account data leakage and order manipulation.

Refined Flow‑Control and Degradation Strategies

Global, hotspot, and caller‑based rate‑limiting protect the system from overload; degradation triggers on high average response times or error ratios, automatically circuit‑breaking problematic paths.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scalability high availability traffic isolation system-design

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.