How Tencent Automated Operations for a Billion‑Red‑Packet Event
This article details Tencent's operation automation for the 2016 Chinese New Year QQ red‑packet activity, describing the massive traffic challenge, the architectural design, the shift from manual to CMDB‑driven one‑click scaling, load‑testing, flexible protection strategies, and on‑site monitoring that enabled rapid, reliable handling of billions of red‑packet transactions.
Preface
The talk originates from GOPS 2017 Shenzhen and examines the operational technology that supported the 2016 Chinese New Year QQ red‑packet activity, where 2.6 billion concurrent users generated 72.9 billion red‑packet clicks.
1. Activity Background
Operations face three major challenges: large‑scale events, big changes, and major incidents. The QQ red‑packet business includes several product forms (instant, luck‑based, AR, and space red packets) and required massive capacity planning.
Challenges
Two months before the event, product metrics projected a peak of 8 million QPS, requiring an estimated 20 000 virtual machines and 3 000 database servers. Supply chain delays further increased scaling pressure.
2. Activity Plan
2.1 Calendar
Preparation spanned two months: product strategy and activity design in November, resource procurement in December, and deployment before the New Year. Business scaling and stress testing occurred in mid‑January, with live operations starting on Chinese New Year's Eve.
2.2 Activity梳理
Complex service chains demanded thorough capacity assessment, including cross‑IDC network capacity, dedicated lines, and disaster‑recovery planning.
3. Scaling
3.1 Architecture of "刷一刷" Red Packet
The system consists of three core components: lottery logic, messaging, and payment. Key modules include unified SSO access, lottery routing using L5 hash consistency, NoSQL storage (Grocery), gift delivery, public account notifications, and CDN resources.
Unified SSO access layer
Lottery main logic with routing
NoSQL storage for records
Gift delivery service
Public account messaging
Payment system with bank interfaces
CDN for resource delivery
The architecture separates stateless (access and logic) and stateful (data persistence) layers, both requiring scaling.
3.2 Stateless Service Auto‑Scaling
3.2.1 Traditional Scaling Process
Manual script‑based deployment involved OS installation, service deployment, configuration distribution, code release, permission management, testing, and monitoring, consuming about half an hour per device and extensive human effort.
3.2.2 Full Automatic Scaling
Using the ZiYun automation platform and a CMDB‑driven workflow, modules can be scaled with a one‑click “cloud‑on” operation, completing a hundred‑device expansion in roughly five minutes. The process includes attribute retrieval, package deployment, alarm silencing, self‑check, gray release, and health‑report generation.
During the New Year period, over 700 automatic scaling workflows were executed, expanding more than 200 modules (average 100 + devices per module).
3.3 Stateful Layer Automatic Scaling
Stateless‑like scaling is applied to access machines, while storage machines use bucket‑level migration. Records are hashed into buckets, each bucket (≈1 GB) is moved to target storage nodes, updating routing tables in real time, achieving near‑zero impact on live traffic.
4. Load Testing and Drills
4.1 Capacity Evaluation
Capacity is assessed across IDC resources (power, cabinets, links) and server metrics (CPU, network, disk I/O, NIC traffic/packet). Business‑level QPS estimates guide scaling decisions.
4.2 Pressure Testing
Multiple stress tests validate whether expanded capacity can sustain peak loads. Real‑time traffic is redirected to a single IDC to observe latency, CPU utilization, and error rates.
4.3 Drills
Simulated traffic shifts (e.g., moving 10 million Shenzhen users to Tianjin) test the system’s resilience under peak conditions.
5. Operations Strategy
5.1 Peak‑Shifting Deployment
Servers are dual‑purpose, running both red‑packet and space services. After the event, resources are reallocated to space services, maximizing utilization.
5.2 Flexible Protection
Layer‑specific overload safeguards include random request throttling based on user hash, throttling cash distribution, and selective message suppression. Policies are deployed instantly via a one‑click interface.
6. On‑Site Operations
6.1 Monitoring
Operators monitor real‑time dashboards, expand overheated modules, and address hot‑key issues by quickly provisioning additional resources.
6.2 Hot‑Key Handling
Hot keys are mitigated by record sharding, length reduction, upgrading NICs to 10 Gbps, bucket redistribution, and front‑end caching.
7. Review
The entire scaling effort for the red‑packet activity was completed in two days, freeing operations from manual scaling tasks and allowing deeper focus on business quality, cost, and speed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
