How Tencent Scaled Mobile QQ: From 500K to 180M Users with Global Smart Scheduling
This article details how Tencent's Mobile QQ grew from a half‑million online users to over 180 million by optimizing network access, evolving backend architecture from a single‑center to a three‑center design, and deploying a real‑time global smart‑scheduling system that reduces packet loss and login latency for both domestic and overseas users.
1. Business Overview
Mobile QQ began in 2003, but real growth started after 2008 when the service reached 5 million online users; by 2009‑2010 it surpassed 10 million, and by 2013 it exceeded 100 million, achieving a 200‑fold increase within a few years thanks to 3G/4G networks and smartphone adoption.
2. Mobile Network Access Fault Cases
2.1 Chongqing Unicom 2G/3G Fault (Dec 2014)
Monitoring alarms showed a rise in packet loss for Chongqing Unicom users from the normal 1‑2% to 3‑4% (some routes up to 10%). Investigation identified a faulty gateway IP and a carrier‑side network cut‑over, which was fixed after two adjustments.
2.2 Hong Kong CSL & New World Telecom Port Restrictions
In early 2011, login‑success rates for users of Hong Kong CSL and New World Telecom dropped below 70% because CSL blocked traffic on port 80 and New World blocked ports 80 and 8080 in certain regions. The issue was detected by analyzing client logs and mitigated by configuring port‑specific routing and global scheduling.
3. Backend Architecture and Deployment Optimization
3.1 2G Era (2004‑2010)
Two gateway types existed: CMNET (socket‑based, unrestricted) and CMWAP (WAP‑only). Applications accessed the backend through carrier‑approved white‑lists; mobile users had to cross‑network to Telecom data centers, causing 5‑7 seconds login latency.
3.2 3G Era (2011‑2013)
Carrier socket restrictions were relaxed, allowing deployment in Mobile and Unicom data centers. This eliminated cross‑network packet loss (from 20‑40% down to ~1%) and reduced latency from ~100 ms to a few tens of milliseconds, saving billions of yuan in settlement fees.
3.3 4G Era (2014‑2017)
When online users exceeded 100 million, a three‑center architecture (Beijing, Shanghai, Shenzhen) with a 1:1:1 distribution was built to improve availability and reduce latency. The “Kepler” project (2015) completed this migration, and during the Tianjin port explosion the system seamlessly shifted 70 million users to other centers within an hour.
4. Global Smart Scheduling
4.1 Network Condition Statistics and Real‑time Intervention
By collecting billions of connection quality metrics, the system can automatically reroute traffic within five minutes when packet loss exceeds a threshold, even down to specific IP ports.
4.2 Intelligent Scheduling Backend
The scheduling engine builds a dispatch database from massive user‑side feedback, then returns the optimal or secondary IP list for a given gateway IP, dramatically improving response speed.
4.3 Daily Packet‑Loss Intervention Results
In provinces with high loss (e.g., Hainan at 26%), automatic cross‑region scheduling reduced loss to about 3%.
4.4 Login Latency Comparison
Users affected by intelligent scheduling experienced an average login time of 1.9 seconds versus 8.6 seconds without intervention.
4.5 Overseas User Acceleration Points
Accelerators were deployed in Canada, US West Coast, Ireland, UK, Singapore, Japan, Brazil, and Australia to serve the 1‑2% overseas QQ user base, with the client automatically selecting the best endpoint based on real‑time quality.
5. Mobile‑Side Network Performance Optimizations
5.1 Signaling Channel Pre‑activation
By sending heartbeat packets while the user is typing, the base station pre‑allocates signaling channels, reducing perceived message‑send latency from ~600 ms to ~400 ms (≈32% improvement).
5.2 IP Direct‑Connect (IP‑Express)
Domain resolution is replaced by an IP‑express service that bypasses DNS hijacking and selects the fastest IP for each user, currently used by over 30 services.
5.3 Server‑Side Logical Aggregation for High Latency
When network latency is high, a server‑side proxy aggregates multiple client requests into a single backend call, reducing a 10‑second operation to 1‑2 seconds.
6. Summary
Backend Architecture & Deployment Eliminate cross‑network access to reduce packet loss and latency. Deploy near‑carrier data centers for faster physical access and better disaster recovery. Scale from single‑center to dual‑center and finally three‑center topology. Global Smart Scheduling Leverage massive user‑side quality data for real‑time network‑condition‑driven routing. For the <1% overseas user base, add acceleration points to improve access quality. Mobile‑Side Network Optimizations Pre‑activate signaling channels to speed up message sending. Use IP‑express to avoid DNS hijacking and achieve optimal routing. Move latency‑sensitive logic to the server to reduce round‑trip times.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.