Operations 10 min read

How Tencent Migrated 200M QQ Users After a Tianjin Explosion

When a massive container explosion threatened Tencent's Tianjin data center, the operations team executed a 24‑hour, continent‑wide user migration that moved over 200 million QQ users to Shenzhen and Shanghai without service interruption, showcasing unprecedented disaster‑recovery capabilities.

Efficient Ops

Jan 18, 2016

How Tencent Migrated 200M QQ Users After a Tianjin Explosion

Full‑Scale Scheduling After the Tianjin Explosion

On August 12 at 23:30, a series of container explosions occurred at the Tianjin Binhai New Area cargo terminal, just 1.5 km from Tencent's Tianjin data center, the largest in Asia with over 200,000 servers covering 80,000 m².

The blast caused cooling system failures, water pipe bursts, and severe flooding, putting the data center at risk of immediate shutdown.

QQ, Qzone, Photo, Music and other core social services are hosted in this center, serving over 200 million online users, with more than 30 % of QQ traffic dependent on the Tianjin site.

Thanks to Tencent's nationwide cloud infrastructure, the team performed a 24‑hour emergency large‑scale scheduling, seamlessly migrating users to Shenzhen and Shanghai without any perceived service disruption, maintaining a “four‑nines” availability level.

1. Initiation

The three‑site deployment (Shenzhen, Tianjin, Shanghai) provides sufficient redundancy to handle a single‑site disaster.

Immediately after the incident, the social operations team activated the major fault handling process, formed an emergency response team, and prepared to shift Tianjin users back to Shenzhen and Shanghai:

The emergency team assigned primary and backup leaders for each business line, coordinating across access, logic, and data layers.

A duty engineer coordinated communication within the team.

A senior incident manager kept directors, operations, QA and other groups informed in real time.

2. Scheduling

On August 13 the team began migrating users in batches of ten million back to Shenzhen.

During the peak hour (22:00) Shenzhen modules reached 80 % capacity; the team leveraged spare server resources to expand capacity while gradually shifting load to keep the water‑mark within safe limits.

For modules that could not be expanded, service flexibility was applied by disabling non‑critical features such as loading contact remarks or retrieving roaming messages.

These strategies ensured smooth user experience during the traffic surge.

Below is the mobile QQ user curve for the Tianjin region:

The chart shows the user count dropping to zero from the night of the 13th to early morning of the 14th, then rising as 60 % of Tianjin users were migrated back in the afternoon.

The detailed migration steps were:

At 01:30 on the 14th, all Tianjin users were fully migrated out; the data center’s online user count reached zero.

By the morning of the 14th, the Tianjin operations team reported that the site could operate stably, allowing 10 million Shenzhen users to be shifted back to Tianjin, relieving network pressure and restoring full functionality.

At noon on the 14th, the Tianjin site’s alerts were cleared and the center resumed normal operation with about 40 million users.

On the 20th, the explosion site was secured, damaged infrastructure repaired, and traffic returned to pre‑incident levels.

Key Challenges of the Migration

Ensuring users experience no loss or latency when switched between regions, requiring strict state and message consistency across three data centers.

Guaranteeing that user status and messages are not lost during the transition.

Achieving full data consistency across multiple locations, handling reliability, real‑time synchronization, high‑volume distribution, and network variability.

Ensuring remote service capacity (servers, modules, IDC) can handle the influx, and defining fallback strategies when capacity is insufficient.

Rapidly scaling down capacity after peak periods.

Minimizing migration time; Tencent achieved second‑level migration speed.

Providing fully automated, one‑click scheduling so a single engineer can control the entire operation.

About SET (Standardized Service Container)

SET is a standardized service module cluster that abstracts complex server interconnections into independent business deployment units.

Each SET aggregates one or more service modules and exposes two capacity metrics: PCU (people‑capacity‑unit, i.e., users per online) and storage capacity (GB), as well as the underlying hardware count and network bandwidth.

SETs are stateless, enabling horizontal scaling by adding more SET instances without affecting existing services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations disaster recovery Tencent large-scale migration user scheduling

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.