High Availability Architecture for Membership Services
The membership service now employs a multi‑region, multi‑data‑center high‑availability architecture that integrates DNS‑based load balancing, automatic failover for MySQL, Redis, and RocketMQ, cross‑region network backup, unified monitoring, and an operations platform, ensuring seamless traffic switching and fault recovery without user impact.
Many internet companies frequently encounter data‑center network failures. When a fault occurs, the entire IT team must handle traffic switching and customer complaints. To prevent such incidents, the company plans to build high‑availability (HA) for its services. The membership department works with the network and infrastructure teams to achieve HA, and has already implemented cross‑region backup.
HA Solution Overview
Network layer: multi‑region, multi‑exit architecture with mutual backup and switching capabilities. The CDN is divided into four regions (North, Central, South, Overseas), each supporting major ISPs to ensure balanced traffic.
Application layer: independent deployment across multiple data centers, aiming for at least two data centers to provide mutual backup and support a "two‑center‑three‑sites" strategy.
Storage layer: multiple instances (MySQL/Redis) that can be switched directionally; faulty instances are automatically taken offline.
Messaging layer: unified use of RMQ, which supports HA and automatic failover, replacing AMQ with RMQ.
Monitoring layer: monitors DNS, applications, databases, and other instances, providing alerts and automated data repair and switching tools.
Network Architecture Upgrade
The system network team developed a domain name and load‑balancer operation platform that monitors faults and switches to backup resources when anomalies occur. Membership services now have three independent outbound IPs that can be isolated and automatically switched at the network layer. DNS regions are split into North, Central, and South, each with dedicated outbound IPs to balance traffic. Automated operations enable traffic switching during fault drills.
IDC Internal Backup
Data centers are interconnected via dedicated lines. Applications are preferably deployed in the same data center as their upstream/downstream services to avoid consuming dedicated bandwidth. DNS periodically checks health; upon detecting a failure, it automatically switches to a backup data center. East‑west traffic within the internal network is large, so each service implements Nginx‑level rate limiting to protect stability. If traffic exceeds a data center’s capacity, rate limiting and alerts are triggered.
Application Service Upgrade
Single‑point workers are refactored to run in multiple IDC clusters, using the open‑source xxl‑job (vip‑job) framework for asynchronous tasks and scheduling. Core services (e.g., video playback) are deployed across as many IDC locations as possible, keeping traffic within the same data center to ensure quality. DNS configurations are optimized per region and ISP to provide the best user experience.
Database Upgrade
The database architecture follows a DNS+HA model. By implementing the Raft protocol, a HA‑Master/HA‑Agent monitoring and switching system was built. When a database instance crashes, the agent sends heartbeat checks, triggers master‑slave failover, and removes the failed instance from DNS, eliminating manual intervention and data loss risk.
Message Middleware and Redis HA
Membership services use RocketMQ provided by the service cloud, with cross‑data‑center backup enabled. ActiveMQ and older RocketMQ versions lacking HA are replaced with upgraded versions supporting mutual backup. Redis employs Sentinel for master‑slave failover, providing convenient HA capabilities.
Operations Platform
An operations platform was developed to integrate DNS and virtual machine resource information, establishing monitoring metrics and tools that support business switching between data centers and daily operations.
Future Outlook
The HA solution for membership services will continue to be refined as network and computing infrastructure evolve. Future focus includes improving resource utilization, optimizing membership services, and ensuring fault recovery without user impact.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
