Operations 20 min read

How Tencent Automated Operations for a Billion‑Red‑Packet Event

This article details Tencent's operation automation for the 2016 Chinese New Year QQ red‑packet activity, describing the massive traffic challenge, the architectural design, the shift from manual to CMDB‑driven one‑click scaling, load‑testing, flexible protection strategies, and on‑site monitoring that enabled rapid, reliable handling of billions of red‑packet transactions.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent Automated Operations for a Billion‑Red‑Packet Event

Preface

The talk originates from GOPS 2017 Shenzhen and examines the operational technology that supported the 2016 Chinese New Year QQ red‑packet activity, where 2.6 billion concurrent users generated 72.9 billion red‑packet clicks.

1. Activity Background

Operations face three major challenges: large‑scale events, big changes, and major incidents. The QQ red‑packet business includes several product forms (instant, luck‑based, AR, and space red packets) and required massive capacity planning.

Challenges

Two months before the event, product metrics projected a peak of 8 million QPS, requiring an estimated 20 000 virtual machines and 3 000 database servers. Supply chain delays further increased scaling pressure.

2. Activity Plan

2.1 Calendar

Preparation spanned two months: product strategy and activity design in November, resource procurement in December, and deployment before the New Year. Business scaling and stress testing occurred in mid‑January, with live operations starting on Chinese New Year's Eve.

2.2 Activity梳理

Complex service chains demanded thorough capacity assessment, including cross‑IDC network capacity, dedicated lines, and disaster‑recovery planning.

3. Scaling

3.1 Architecture of "刷一刷" Red Packet

The system consists of three core components: lottery logic, messaging, and payment. Key modules include unified SSO access, lottery routing using L5 hash consistency, NoSQL storage (Grocery), gift delivery, public account notifications, and CDN resources.

Unified SSO access layer

Lottery main logic with routing

NoSQL storage for records

Gift delivery service

Public account messaging

Payment system with bank interfaces

CDN for resource delivery

The architecture separates stateless (access and logic) and stateful (data persistence) layers, both requiring scaling.

3.2 Stateless Service Auto‑Scaling

3.2.1 Traditional Scaling Process

Manual script‑based deployment involved OS installation, service deployment, configuration distribution, code release, permission management, testing, and monitoring, consuming about half an hour per device and extensive human effort.

3.2.2 Full Automatic Scaling

Using the ZiYun automation platform and a CMDB‑driven workflow, modules can be scaled with a one‑click “cloud‑on” operation, completing a hundred‑device expansion in roughly five minutes. The process includes attribute retrieval, package deployment, alarm silencing, self‑check, gray release, and health‑report generation.

During the New Year period, over 700 automatic scaling workflows were executed, expanding more than 200 modules (average 100 + devices per module).

3.3 Stateful Layer Automatic Scaling

Stateless‑like scaling is applied to access machines, while storage machines use bucket‑level migration. Records are hashed into buckets, each bucket (≈1 GB) is moved to target storage nodes, updating routing tables in real time, achieving near‑zero impact on live traffic.

4. Load Testing and Drills

4.1 Capacity Evaluation

Capacity is assessed across IDC resources (power, cabinets, links) and server metrics (CPU, network, disk I/O, NIC traffic/packet). Business‑level QPS estimates guide scaling decisions.

4.2 Pressure Testing

Multiple stress tests validate whether expanded capacity can sustain peak loads. Real‑time traffic is redirected to a single IDC to observe latency, CPU utilization, and error rates.

4.3 Drills

Simulated traffic shifts (e.g., moving 10 million Shenzhen users to Tianjin) test the system’s resilience under peak conditions.

5. Operations Strategy

5.1 Peak‑Shifting Deployment

Servers are dual‑purpose, running both red‑packet and space services. After the event, resources are reallocated to space services, maximizing utilization.

5.2 Flexible Protection

Layer‑specific overload safeguards include random request throttling based on user hash, throttling cash distribution, and selective message suppression. Policies are deployed instantly via a one‑click interface.

6. On‑Site Operations

6.1 Monitoring

Operators monitor real‑time dashboards, expand overheated modules, and address hot‑key issues by quickly provisioning additional resources.

6.2 Hot‑Key Handling

Hot keys are mitigated by record sharding, length reduction, upgrading NICs to 10 Gbps, bucket redistribution, and front‑end caching.

7. Review

The entire scaling effort for the red‑packet activity was completed in two days, freeing operations from manual scaling tasks and allowing deeper focus on business quality, cost, and speed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AutomationOperationsscalingTencentCMDBlarge‑event
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.