Backend Development 20 min read

High‑Availability Architecture and Migration Strategies for a Large‑Scale Membership System

This article details the design and implementation of a highly available membership platform, covering Elasticsearch dual‑center primary‑backup clusters, traffic‑isolation architectures, deep ES optimizations, Redis caching and dual‑center clusters, MySQL partitioned clusters, seamless SqlServer‑to‑MySQL migration, abnormal member governance, and refined flow‑control and degradation strategies.

Top Architect
Top Architect
Top Architect
High‑Availability Architecture and Migration Strategies for a Large‑Scale Membership System

Background

The membership system is a core service for order processing across all business lines; any outage blocks user orders, so it must provide high performance and high availability. After the merger of Tongcheng and eLong, the system must handle billions of members and massive concurrent traffic, especially during peak periods.

Elasticsearch Dual‑Center Primary‑Backup Cluster

A dual‑center architecture deploys the ES primary cluster in data‑center A and a standby cluster in data‑center B. Writes go to the primary cluster and are synchronized to the standby via MQ. If the primary fails, traffic is switched to the standby with minimal downtime.

Traffic Isolation Three‑Cluster Architecture

To protect the main order flow from marketing spikes, a dedicated ES cluster handles high‑TPS marketing requests, isolating them from the primary ES cluster.

ES Deep Optimization

Optimizations include reducing thread‑pool size, limiting shard memory, using filter queries instead of query, removing unnecessary text fields, moving sorting to the application layer, and adding routing keys, which together dramatically lowered CPU usage and query latency.

Redis Cache Solution

Because Elasticsearch updates are near‑real‑time (≈1 s delay), a distributed lock and delayed cache invalidation are used to avoid stale data in Redis, achieving >90 % cache hit rate and relieving ES pressure.

Redis Dual‑Center Multi‑Cluster

Two Redis clusters are deployed in data‑centers A and B; writes are dual‑written and succeed only when both succeed, while reads are served locally, ensuring high availability even if one data‑center fails.

MySQL Dual‑Center Partition Cluster

The member database is sharded into over 1,000 partitions, each holding about a million rows, and deployed as a 1‑master‑3‑slave cluster across two data‑centers with sub‑millisecond replication.

Smooth Migration from SqlServer to MySQL

The migration follows a "full sync → incremental sync → real‑time gray‑switch" approach, using real‑time dual writes, retry mechanisms, and A/B traffic routing to ensure data consistency and zero‑downtime cut‑over.

Abnormal Member Relationship Governance

Complex bugs that caused cross‑account data leakage were identified and fixed at the code‑logic layer, preventing scenarios where a user could see or modify another user's orders.

Refined Flow‑Control and Degradation Strategies

Three levels of flow control (hotspot, per‑caller, global) and degradation based on response time, error count, and error ratio are applied to protect the system under extreme load, ensuring the core order flow remains stable.

BackendmigrationElasticsearchHigh AvailabilityRedistraffic controlMySQL
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.