How Tencent Achieved Zero‑Impact Disaster Recovery for Hundreds of Millions of Users
This article details Tencent's multi‑region disaster‑recovery architecture and rapid, user‑transparent scheduling techniques that enable seamless service continuity for QQ and Qzone across billions of daily users, illustrated through real‑world drills and performance metrics.
Introduction
Tencent's Social Network Business Group operates over 100,000 servers distributed across multiple IDC locations nationwide, supporting massive user traffic. Complex multi‑IDC management can lead to sudden issues such as network outages, latency spikes, power failures, or fiber cuts, which, if not detected and resolved promptly, may degrade service quality or cause widespread login failures.
To ensure users experience zero impact during such incidents, a robust multi‑region disaster‑recovery architecture combined with fast, network‑wide scheduling capabilities is required.
This article examines the QQ Mobile and Qzone platforms, describing how they maintain uninterrupted service quality through large‑scale, time‑bound scheduling exercises.
Three‑region active‑active disaster recovery
Lightning‑fast scheduling capability
300 million‑user practical drills
Three‑Region Active‑Active Disaster Recovery
Providing high‑quality, controllable service to hundreds of millions of users demands that all development and operations focus on safeguarding user experience, especially during unforeseen events.
During the 2015 Tianjin explosion, Tencent's data center in Tianjin was within one kilometer of the blast site. By repeatedly scheduling and flexibly controlling traffic, the entire Tianjin user base was seamlessly shifted to Shenzhen without users noticing.
The current disaster‑recovery architecture distributes services across three regions (three‑active), employing set‑based deployment, balanced link distribution, and capacity planning to reduce risk.
QQ and Qzone have evolved from single‑site to dual‑site and finally to three‑site deployments, improving service quality and allowing users to connect to the nearest region.
QQ and Qzone user data are evenly distributed across three regions (1:1:1).
Normal load on a single site stays below 66%; in a zero‑impact scenario, users can be shifted to any of the other two sites.
For brevity, “dual‑platform” refers to the combined QQ + Qzone system unless otherwise specified.
Lightning‑Fast Scheduling Capability
Scheduling begins at the traffic entry point, with both platforms following the same approach.
1. Mobile QQ Access Layer
The front‑end supports 2.59 billion concurrent online users, connecting to hundreds of backend modules. The access layer consists of thousands of machines across dozens of IDC locations in three major cities, processing over 2 billion business packets per minute, 24/7.
Mobile QQ clients do not connect directly to the SSO service; instead, they pass through Tencent Gateway (TGW), an internally developed multi‑network unified access system offering high reliability, scalability, performance, and strong anti‑attack capabilities.
The QQ user login flow is illustrated below:
Qzone traffic primarily originates from Mobile QQ, enabling coordinated scheduling across both platforms.
2. Scheduling Mechanism
Scheduling intervenes at the user access point. A high‑level flow diagram is shown below:
Two main scheduling directions are employed:
Speed‑based scheduling:
Calculate optimal network paths for the entire network.
Real‑time intervention to route users onto optimal paths.
Fine‑grained control, e.g., scheduling by gateway IP.
Redirect scheduling:
Disable newly created VIP client connections.
Redirect already logged‑in VIP users to the new VIP endpoint.
Under no backend pressure, the system can migrate tens of millions of online users within ten minutes, with users remaining unaware of the transition.
Scheduling scenarios:
Normal distribution uses network‑quality‑based speed scheduling.
Emergency events employ rapid redirect scheduling.
Cross‑carrier scheduling (e.g., moving China Telecom users to China Unicom) is avoided unless absolutely necessary.
Scheduling operations:
Complete configuration within minutes and push calculations in real time.
Fully automated estimation of capacity changes across the three regions.
300 Million‑User Practical Drills
Two representative scenarios illustrate typical challenges faced by operations teams:
Scenario 1: During a severe storm, network‑exit latency spikes in a city, prompting a rapid migration of millions of users to another city while the scheduling system experiences a temporary failure, requiring manual intervention and flexible fallback.
Scenario 2: A traffic surge in City A overloads the region; an attempt to shift users to City B reveals a bottleneck in City B's link, risking overload if the migration proceeds unchecked.
These cases highlight the necessity of validated, real‑world capabilities rather than theoretical tools.
Why Conduct Live‑Network Drills?
Disaster‑recovery capacity and capacity planning are essential operational competencies; continuous drills transform theoretical abilities into actionable weapons during critical moments.
Drill Planning
QQ’s massive scale (DAU ≈ 830 million) and complex service graph mean that a single node failure can affect a large user base. A detailed, closed‑loop drill process is required to mitigate risk.
The drill lifecycle consists of three phases:
Pre‑drill planning and preparation.
Drill execution and real‑time monitoring.
Post‑drill evaluation, quality assessment, and issue tracking.
Drill Objectives
Data collected from drills validates business quality and capacity, quantifies scheduling performance, and assesses platform readiness.
Business quality & capacity validation:
Confirm that three‑region capacity meets expectations.
Ensure load remains controllable when adding tens of millions of users.
Verify backend resilience under concentrated login bursts.
Check that flexible control behaves as designed.
Scheduling quantification:
Measure users migrated per minute during cross‑region scheduling.
Determine time required to move 10 million users.
Assess time to empty a city’s user base.
Validate stable, balanced scheduling rates.
Platform assessment:
Evaluate real‑time capacity, regional capacity, scheduling platform, and quality monitoring capabilities.
Identify platform shortcomings and use capacity metrics to gauge scheduling effectiveness.
Drill Outcomes
Monthly and quarterly drills are conducted, including peak‑traffic scenarios. Throughout all drills, users experience zero impact and no complaints, marking a first in the dual‑platform’s history. Drill scale progressed from 20 million to 40 million users, eventually emptying an entire city.
Complaint volume, monitored via the “Uranus” public‑opinion system, remained unchanged, confirming user‑impact‑free operation. Over nine drills, substantial data and standardized scheduling procedures were established, enabling a single operator to execute future drills.
Regularization:
Monthly drills handling 40–80 million users.
Joint network‑operations drills simulating city‑wide outages.
Drill Quality Evaluation
Quality is assessed along two dimensions:
Scheduling quality: efficiency, rate stability, and volume compliance.
Business quality: user complaints, backend service health, alarm generation, and load growth.
Closed‑Loop Issue Tracking
The goal of drills is to uncover problems, not to showcase capabilities. Every identified issue must be fully resolved to continuously improve architecture and operational competence.
Conclusion
In massive‑user, complex internet environments, achieving precise and rapid user scheduling is challenging. The nine real‑world drills have revealed optimization opportunities for the scheduling platform, architecture, and rate control. Success depends on a tightly integrated, closed‑loop process rather than a single powerful scheduler.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.