Operations 15 min read

How QQ Achieves Massive Multi‑Region Scheduling and Resilient Operations

This article details Tencent’s QQ large‑scale scheduling architecture, covering multi‑site distribution, rapid dispatch mechanisms, cross‑region data synchronization, automated operation platform ZhiYun, flexible services, overload protection, rapid scaling, and comprehensive monitoring that together enable resilient, high‑performance social services.

Efficient Ops

Jan 19, 2016

How QQ Achieves Massive Multi‑Region Scheduling and Resilient Operations

Technical Architecture Behind Large‑Scale Scheduling

1. Multi‑site Distribution and Disaster Recovery

QQ and its services are deployed using a standardized SET model, where QQ numbers are partitioned into shards by a unit‑based consistent hashing algorithm. Core modules are decoupled into more than 100 components grouped into Access, Messaging, Status, and Sync centers, allowing flexible composition based on physical distribution and online capacity.

Based on this foundation, QQ core services are deployed across three regions (Shenzhen, Tianjin, Shanghai). When a single region fails, services can be quickly shifted to the other two regions.

Figure: QQ three‑site distribution diagram

QQ Space, Music, Album and other services also use a three‑layer SET deployment (PC SET, Mobile SET, Data SET). Each SET incorporates the three‑site disaster‑recovery concept, distributing users via GSLB and mobile connectivity so that each region can independently provide core social services.

Figure: QQ Space three‑site distribution diagram

The cross‑region SET access flow is:

End user resolves domain name to reach the access layer.

Request is sent to the access layer.

Access layer looks up the logical layer via internal name service.

Access layer contacts the logical layer.

Logical layer locates the data layer via internal name service and writes data.

Local data‑layer SET synchronizes to remote data‑layer SETs.

Figure: Mobile QQ one‑click dispatch form

2. Rapid Dispatch Capability

Dispatch operates both externally and internally. External dispatch includes GSLB DNS‑based routing, QQ IP dispatch, and the internal WNS (Wireless Network Service). Internal dispatch relies on L5 and CMLB name services.

2.1 GSLB Global DNS Service

GSLB resolves user IP to geographic location and returns region‑specific IPs, enabling PC‑QQ scheduling based on user location.

2.2 QQ IP Dispatch

QQ can dispatch down to rack‑level precision. Real‑time calculations determine the optimal access point (city, IDC, network module, server IP). Within seconds, users are redirected to the selected data center.

2.3 WNS (Wireless Network Service)

WNS provides a high‑connectivity, highly reliable network channel for apps. It continuously optimizes dispatch algorithms using massive operational data, allowing mobile users to be routed to the nearest optimal IP and to trigger reconnections when needed.

2.4 L5 / CMLB

L5 (Level‑5 Load Balancer) and CMLB are internally developed components that combine name service, load balancing, fault tolerance, and overload protection. They assign weights to machines based on success rate and latency, then distribute traffic using efficient quota algorithms.

2.5 One‑Click Dispatch

Engineers can trigger full‑network and internal dispatch with a single click on the internal automation platform “ZhiYun”, achieving smooth, user‑transparent migration of millions of users within 30 minutes.

3. Multi‑Region Data Synchronization

Beyond compute distribution, QQ faces the challenge of synchronizing storage across regions. Three mechanisms are used:

QQ status synchronization.

QQ DB master‑slave replication.

Synchronization center for QQ Space.

3.1 QQ Status Synchronization

All regions store full data, but user login is routed to the nearest IDC. Status information (online/offline, login devices, etc.) must be fully synchronized across regions within seconds. The system uses sync agents, de‑duplication, multi‑level SEQ/timestamp mechanisms, and TCP streams with loss tolerance and multi‑level flow control to guarantee consistency and reliability.

Figure: QQ status synchronization architecture

3.2 QQ DB Master‑Slave Replication

QQ DB uses the internally developed Grocery distributed KV store, employing a MySQL‑like master‑slave model with one‑master‑multiple‑slave configuration. Masters handle reads/writes; slaves provide read‑only access. Masters and slaves can be placed in the same IDC or across different regions, with sequence‑based logs ensuring consistency.

3.3 Synchronization Center for QQ Space

QQ Space relies on the CKV distributed KV store. A message‑queue‑based synchronization center receives writes from applications and distributes them to regional sync readers, achieving sub‑second multi‑region data sync.

Operational Capabilities

1. ZhiYun Automated Operations Platform

ZhiYun automates configuration‑centric operations for thousands of business modules, enabling zero‑touch scaling and migration across the entire social network. Engineers define configurations and workflows; ZhiYun executes changes automatically, reducing a multi‑step manual process to a single click completed in about ten minutes.

2. Flexible Services and Overload Protection

Non‑core services can be disabled via configuration switches during capacity spikes or regional failures, preserving core functionality. Overload protection is built into components like L5, which uses request quality metrics to set access thresholds, throttles excess traffic, and drops timed‑out requests to prevent cascade failures.

3. Rapid Scaling

Using the standardized SET deployment and ZhiYun, a SET comprising hundreds of servers can be fully provisioned, tested, and launched within ten minutes, covering OS installation, package deployment, and application rollout.

4. Three‑Dimensional Monitoring

The monitoring platform (e.g., “Tianwang”) provides real‑time capacity views across PC and mobile, enabling immediate awareness of system state before and after dispatch. Business‑level “DLP” (death line) alerts trigger root‑cause analysis and fault localization within ten minutes.

5. Technical Assurance

Continuous load testing, pre‑drills, and joint business‑operation rehearsals uncover system bottlenecks and improve response speed. Detailed incident response plans involve a major‑incident manager, on‑call engineers, and QA, ensuring coordinated emergency actions and rapid communication across teams.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems multi-region deployment Tencent Operations Automation Large-Scale Scheduling

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.