How We Built 99.99% High Availability for a Billion‑User Membership System
This article details the end‑to‑end high‑availability architecture—including dual‑center Elasticsearch clusters, Redis caching with distributed locks, and a dual‑center MySQL partitioned setup—that enables a membership platform serving billions of users to sustain massive traffic while ensuring data consistency and rapid recovery.
Background
The membership system is a core service tightly linked to the order flow of all business lines; any outage prevents users from placing orders, so it must deliver high performance and high availability.
After the merger of two companies, the system must support multiple platforms (apps, mini‑programs) and handle cross‑marketing scenarios, leading to rapidly increasing request volume, with peak TPS exceeding 20,000.
Elasticsearch High‑Availability Solution
1. Dual‑center primary‑backup cluster architecture
Two data centers (A and B) host a primary ES cluster in A and a backup cluster in B. Writes go to the primary; data is synchronized to the backup via MQ. If the primary fails, traffic is switched to the backup instantly, and after recovery the data is resynced.
2. Traffic isolation three‑cluster architecture
Requests are classified into two priority groups: user‑order‑critical and marketing‑driven. A dedicated ES cluster handles high‑TPS marketing traffic, isolating it from the primary cluster that serves order‑critical requests.
3. Deep ES optimizations
Balanced shard distribution to avoid hotspot nodes.
Thread‑pool size limited to "cpu core * 3 / 2 + 1" to prevent CPU spikes.
Shard memory limited to 50 GB each.
Removed unnecessary text fields, keeping only keyword for member queries.
Used filter instead of query to skip relevance scoring.
Performed result sorting in the member service JVM.
Added routing keys to direct queries to relevant shards.
These tweaks dramatically reduced CPU usage and improved query latency.
Interface response times also dropped significantly.
Member Redis Cache Scheme
Initially the system avoided caching due to real‑time consistency concerns, but a high‑traffic ticket‑blind‑box event prompted the introduction of a Redis cache.
1. Solving ES‑induced cache inconsistency
Because ES updates are near‑real‑time (≈1 s delay), a race could cause stale data to be written back to Redis. The solution adds a 2‑second distributed lock during ES updates and deletes the Redis entry; subsequent reads acquire the lock, detect the pending ES update, and skip writing back stale data.
Further analysis revealed a possible race between delete and write operations; making these actions mutually exclusive resolves the issue.
After deployment, cache hit rate exceeded 90%, greatly relieving ES pressure.
2. Redis dual‑center multi‑cluster architecture
Two data centers each host a Redis cluster. Writes are performed to both clusters (dual‑write); reads are served locally to minimize latency. If one center fails, the other continues providing full service.
High‑Availability Member Primary DB Scheme
The original SQL Server reached physical limits after storing over a billion records, prompting migration to MySQL.
1. MySQL dual‑center partitioned cluster
Over 1,000 shards distribute the data, each holding roughly one million rows. The cluster uses a 1‑master‑3‑slave setup, with the master in data center A and slaves in B, synchronized via a dedicated link with sub‑millisecond latency. Reads are routed locally, writes go to the master.
Stress tests showed >20k TPS with average latency under 10 ms.
2. Smooth migration strategy
The migration follows a "full sync → incremental sync → real‑time gray‑switch" approach. Dual‑write ensures new writes go to both databases; failures trigger retries and logging. Gray‑switch gradually shifts read traffic from SQL Server to MySQL, verifying result consistency at each step.
Full data sync occurs during low‑traffic windows, followed by dual‑write, incremental sync, and finally 100% traffic on MySQL.
3. MySQL‑ES primary‑backup fallback
To guard against DAL component failures, data is also written to Elasticsearch. If MySQL or DAL fails, reads can be switched to ES, and once MySQL recovers, data is resynced and traffic switched back.
Abnormal Member Relationship Governance
Complex concurrency bugs caused cross‑account bindings, leading to privacy breaches and financial losses. A thorough audit and deep code‑level safeguards were implemented to detect and prevent such anomalies.
Outlook: Finer‑Grained Flow Control and Degradation
1. Refined flow‑control strategies
Hotspot limiting for abusive accounts, per‑caller flow rules to curb buggy code, and global thresholds to protect the system under extreme traffic spikes.
2. Refined degradation strategies
Automatic degradation based on average response time and error rates of dependent services, with circuit‑breaker behavior when thresholds are exceeded.
Future work includes comprehensive auditing of all member‑access accounts to ensure accurate flow‑control rule application.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
