High‑Availability Architecture and Optimization Strategies for a Large‑Scale Membership System

This article describes the design, high‑availability solutions, traffic isolation, deep performance optimizations, caching strategies, dual‑center MySQL partitioning, seamless migration, and future fine‑grained flow‑control and degradation techniques employed to keep a billion‑user membership system stable and performant under extreme load.

Top Architect
Top Architect
Top Architect
High‑Availability Architecture and Optimization Strategies for a Large‑Scale Membership System

Background – The membership system is a core service for all business lines; any outage blocks order placement across the company. After the merger of Tongcheng and eLong, the system must support cross‑platform member queries for multiple apps and mini‑programs, handling traffic spikes exceeding 20,000 TPS during holidays.

ES High‑Availability Solution – A dual‑center primary‑backup Elasticsearch cluster is deployed across two data centers (A and B). The primary cluster handles reads/writes, while changes are replicated to the backup via MQ. In case of node or data‑center failure, traffic is switched to the backup cluster within seconds, and later synchronized back.

ES Traffic Isolation Three‑Cluster Architecture – To protect the main ES cluster from massive marketing‑driven traffic, a separate ES cluster handles high‑TPS flash‑sale requests, ensuring the order‑critical path remains unaffected.

Deep ES Optimizations – Issues such as uneven shard distribution, oversized thread pools, large shard memory (100 GB), redundant text/keyword fields, misuse of query instead of filter, and lack of routing keys were addressed. After tuning, CPU usage and query latency dropped dramatically, as shown in the performance charts.

Member Redis Cache Scheme – Initially avoided caching due to real‑time consistency concerns, but introduced a Redis cache with a 2‑second distributed lock to prevent stale data after Elasticsearch’s near‑real‑time delay. Cache hit rates exceed 90 %, greatly reducing ES load.

Redis Dual‑Center Multi‑Cluster Architecture – Two Redis clusters (one in each data center) are written synchronously; reads are served locally, providing high availability even if an entire data center fails.

High‑Availability Member Primary Database – The relational member data migrated from a single SQL Server instance to a dual‑center MySQL partitioned cluster (over 1,000 shards, 1‑master‑3‑slave setup). Real‑time dual‑write ensures seamless cut‑over, with automated gray‑scale traffic migration and consistency checks.

Migration and Gray‑Scale Strategy – A phased approach of full data sync, real‑time dual‑write, incremental sync, and A/B traffic gray‑scale was used to move from SQL Server to MySQL without downtime, with fallback mechanisms for any inconsistency.

Abnormal Member Relationship Governance – Complex logic was added to detect and block erroneous cross‑account bindings, preventing data leakage and financial loss.

Future Fine‑Grained Flow‑Control and Degradation – Plans include hotspot throttling, per‑client flow‑control, global rate limiting, and degradation based on average response time or error ratios to ensure the system remains resilient under extreme load.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

backendScalabilityhigh-availabilitysystem-architecture
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.