Operations 16 min read

How Tencent Achieved Zero‑Impact Disaster Recovery for Hundreds of Millions of Users

This article details Tencent's multi‑region disaster‑recovery architecture and rapid, user‑transparent scheduling techniques that enable seamless service continuity for QQ and Qzone across billions of daily users, illustrated through real‑world drills and performance metrics.

Efficient Ops

Nov 2, 2016

How Tencent Achieved Zero‑Impact Disaster Recovery for Hundreds of Millions of Users

Introduction

Tencent's Social Network Business Group operates over 100,000 servers distributed across multiple IDC locations nationwide, supporting massive user traffic. Complex multi‑IDC management can lead to sudden issues such as network outages, latency spikes, power failures, or fiber cuts, which, if not detected and resolved promptly, may degrade service quality or cause widespread login failures.

To ensure users experience zero impact during such incidents, a robust multi‑region disaster‑recovery architecture combined with fast, network‑wide scheduling capabilities is required.

This article examines the QQ Mobile and Qzone platforms, describing how they maintain uninterrupted service quality through large‑scale, time‑bound scheduling exercises.

Three‑region active‑active disaster recovery

Lightning‑fast scheduling capability

300 million‑user practical drills

Three‑Region Active‑Active Disaster Recovery

Providing high‑quality, controllable service to hundreds of millions of users demands that all development and operations focus on safeguarding user experience, especially during unforeseen events.

During the 2015 Tianjin explosion, Tencent's data center in Tianjin was within one kilometer of the blast site. By repeatedly scheduling and flexibly controlling traffic, the entire Tianjin user base was seamlessly shifted to Shenzhen without users noticing.

The current disaster‑recovery architecture distributes services across three regions (three‑active), employing set‑based deployment, balanced link distribution, and capacity planning to reduce risk.

QQ and Qzone have evolved from single‑site to dual‑site and finally to three‑site deployments, improving service quality and allowing users to connect to the nearest region.

QQ and Qzone user data are evenly distributed across three regions (1:1:1).

Normal load on a single site stays below 66%; in a zero‑impact scenario, users can be shifted to any of the other two sites.

For brevity, “dual‑platform” refers to the combined QQ + Qzone system unless otherwise specified.

Lightning‑Fast Scheduling Capability

Scheduling begins at the traffic entry point, with both platforms following the same approach.

1. Mobile QQ Access Layer

The front‑end supports 2.59 billion concurrent online users, connecting to hundreds of backend modules. The access layer consists of thousands of machines across dozens of IDC locations in three major cities, processing over 2 billion business packets per minute, 24/7.

Mobile QQ clients do not connect directly to the SSO service; instead, they pass through Tencent Gateway (TGW), an internally developed multi‑network unified access system offering high reliability, scalability, performance, and strong anti‑attack capabilities.

The QQ user login flow is illustrated below:

Qzone traffic primarily originates from Mobile QQ, enabling coordinated scheduling across both platforms.

2. Scheduling Mechanism

Scheduling intervenes at the user access point. A high‑level flow diagram is shown below:

Two main scheduling directions are employed:

Speed‑based scheduling:

Calculate optimal network paths for the entire network.

Real‑time intervention to route users onto optimal paths.

Fine‑grained control, e.g., scheduling by gateway IP.

Redirect scheduling:

Disable newly created VIP client connections.

Redirect already logged‑in VIP users to the new VIP endpoint.

Under no backend pressure, the system can migrate tens of millions of online users within ten minutes, with users remaining unaware of the transition.

Scheduling scenarios:

Normal distribution uses network‑quality‑based speed scheduling.

Emergency events employ rapid redirect scheduling.

Cross‑carrier scheduling (e.g., moving China Telecom users to China Unicom) is avoided unless absolutely necessary.

Scheduling operations:

Complete configuration within minutes and push calculations in real time.

Fully automated estimation of capacity changes across the three regions.

300 Million‑User Practical Drills

Two representative scenarios illustrate typical challenges faced by operations teams:

Scenario 1: During a severe storm, network‑exit latency spikes in a city, prompting a rapid migration of millions of users to another city while the scheduling system experiences a temporary failure, requiring manual intervention and flexible fallback.

Scenario 2: A traffic surge in City A overloads the region; an attempt to shift users to City B reveals a bottleneck in City B's link, risking overload if the migration proceeds unchecked.

These cases highlight the necessity of validated, real‑world capabilities rather than theoretical tools.

Why Conduct Live‑Network Drills?

Disaster‑recovery capacity and capacity planning are essential operational competencies; continuous drills transform theoretical abilities into actionable weapons during critical moments.

Drill Planning

QQ’s massive scale (DAU ≈ 830 million) and complex service graph mean that a single node failure can affect a large user base. A detailed, closed‑loop drill process is required to mitigate risk.

The drill lifecycle consists of three phases:

Pre‑drill planning and preparation.

Drill execution and real‑time monitoring.

Post‑drill evaluation, quality assessment, and issue tracking.

Drill Objectives

Data collected from drills validates business quality and capacity, quantifies scheduling performance, and assesses platform readiness.

Business quality & capacity validation:

Confirm that three‑region capacity meets expectations.

Ensure load remains controllable when adding tens of millions of users.

Verify backend resilience under concentrated login bursts.

Check that flexible control behaves as designed.

Scheduling quantification:

Measure users migrated per minute during cross‑region scheduling.

Determine time required to move 10 million users.

Assess time to empty a city’s user base.

Validate stable, balanced scheduling rates.

Platform assessment:

Evaluate real‑time capacity, regional capacity, scheduling platform, and quality monitoring capabilities.

Identify platform shortcomings and use capacity metrics to gauge scheduling effectiveness.

Drill Outcomes

Monthly and quarterly drills are conducted, including peak‑traffic scenarios. Throughout all drills, users experience zero impact and no complaints, marking a first in the dual‑platform’s history. Drill scale progressed from 20 million to 40 million users, eventually emptying an entire city.

Complaint volume, monitored via the “Uranus” public‑opinion system, remained unchanged, confirming user‑impact‑free operation. Over nine drills, substantial data and standardized scheduling procedures were established, enabling a single operator to execute future drills.

Regularization:

Monthly drills handling 40–80 million users.

Joint network‑operations drills simulating city‑wide outages.

Drill Quality Evaluation

Quality is assessed along two dimensions:

Scheduling quality: efficiency, rate stability, and volume compliance.

Business quality: user complaints, backend service health, alarm generation, and load growth.

Closed‑Loop Issue Tracking

The goal of drills is to uncover problems, not to showcase capabilities. Every identified issue must be fully resolved to continuously improve architecture and operational competence.

Conclusion

In massive‑user, complex internet environments, achieving precise and rapid user scheduling is challenging. The nine real‑world drills have revealed optimization opportunities for the scheduling platform, architecture, and rate control. Success depends on a tightly integrated, closed‑loop process rather than a single powerful scheduler.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations capacity planning Tencent Large-Scale Scheduling

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Three‑Region Active‑Active Disaster Recovery

Lightning‑Fast Scheduling Capability

1. Mobile QQ Access Layer

2. Scheduling Mechanism

300 Million‑User Practical Drills

Why Conduct Live‑Network Drills?

Drill Planning

Drill Objectives

Drill Outcomes

Drill Quality Evaluation

Closed‑Loop Issue Tracking

Conclusion

Efficient Ops

How this landed with the community

Was this worth your time?

0 Comments

300 Million‑User Practical Drills