Operations 16 min read

How QQ Built Multi‑Region Resilience with Set‑Based Deployment and Smart Scheduling

This article explains how QQ’s operations team designed a multi‑region, set‑based deployment architecture, tackled data synchronization, employed sharding strategies, and implemented flexible scheduling policies to ensure high availability and rapid disaster recovery for hundreds of millions of users.

Efficient Ops
Efficient Ops
Efficient Ops
How QQ Built Multi‑Region Resilience with Set‑Based Deployment and Smart Scheduling

1. QQ Multi‑Region Deployment Overview

To reduce risk after the Tianjin port explosion, QQ migrated 75 million users from a nearby IDC to data centers in Shanghai and Shenzhen, creating a three‑region deployment (Shanghai, Shenzhen, Tianjin) that can each independently handle the full user load.

The service runs 24×7, requiring careful change management; users are gradually shifted before any major updates to minimize impact. Since China’s network quality varies across carriers and regions, access points were distributed nationwide, and the deployment began in 2011.

Users are automatically balanced among the three regions based on real‑time quality measurements, ensuring that any two sites can support the entire QQ user base. Regular capacity‑stress tests are performed, with each region tested up to half of its maximum capability.

Set‑Based Deployment and Stateless Service Scheduling

Set‑based deployment breaks a large system into many small, self‑contained sets that contain complete logic and data, allowing independent scaling, shrinking, and near‑site scheduling.

QQ’s architecture consists of three layers: an access layer that connects users and routes messages, a logic layer that processes group communication, and a data layer that stores user profiles and other assets.

By assigning users to a specific set based on network quality or policy, all their logic and data accesses occur within the same set, reducing cross‑region traffic. Only profile changes or nickname updates need to be synchronized globally.

3. Data Layer Multi‑Site Deployment and Scheduling

QQ’s core data is split into relationship data (profiles, friends) with high read‑low write characteristics, and status data (online presence) with high read‑high write characteristics, the latter generating massive synchronization traffic.

Data is stored in memory for low latency, backed by disk‑based binlog for durability. Servers with 128 GB RAM achieve around 30 k operations per second; the bottleneck is network bandwidth rather than storage.

Two sharding approaches are used:

Consistent hashing distributes data across nodes but complicates node‑specific troubleshooting and data migration.

Range (segment) partitioning assigns user ID ranges to specific shards, simplifying fault isolation and enabling hot‑expansion.

During expansion, configuration lines define which shards serve which data segments. New shards are added as replicas, synchronized, and then gradually promoted, with the ability to roll back if issues arise.

Scheduling Strategies

Three scheduling modes handle user routing during incidents:

Mask scheduling : filters out problematic IDC IPs for the next login.

Weak disable scheduling : forces immediate re‑login with a new IP list.

Strong disable scheduling : forcibly logs users out to trigger an instant re‑login.

These mechanisms allow rapid response to events such as the Tianjin explosion or other disasters, ensuring continuous service with minimal user impact.

operationsdeploymentShardingHigh Availabilitymulti-regionset-based
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.