How We Scaled a New Year Gala Shake‑to‑Win System to 10 Million Requests per Second
This article revisits the technical design and evolution of the 2015 Chinese New Year Gala "shake" activity, detailing how the backend architecture was progressively refined—from a simple prototype to a production‑grade system capable of handling tens of millions of concurrent requests—through resource pre‑download, access‑layer integration, load‑balancing, and robust fail‑over mechanisms.
This is a technical retrospective of the 2015 Chinese New Year Gala "shake" activity, revisited after two years to share architectural lessons for similar large‑scale events.
Spring Festival Gala Shake Activity
The activity reused the shake entry but offered new interfaces such as celebrity greetings, family photos, friend cards, rest pages, and special "server hang" pages. Users could also win red packets (seed and split packets) to share with friends.
V0.1 Prototype System
The prototype implemented the core requirements with a simple architecture.
Processing flow:
Client shakes phone, sending a request to the access service, which forwards it to the shake service.
The shake service evaluates the program flow and returns a result (celebrity greeting, red packet, etc.).
If a red packet is selected, the client downloads sponsor assets from CDN and displays the packet.
When the user opens the packet, the request goes through the red‑packet system, payment system, and finally the Tenpay system.
Users can share packets via the messaging system.
A security subsystem protects the business logic.
The data flow includes resource, information, business, and financial streams, with the article focusing on resource and information flows.
Challenges
Massive user requests: estimated peak of 10 million requests per second.
Uncertain factors during the live show (changing program order, duration, etc.).
Deep customization: the system would run only a few hours live, leaving no room for gradual rollout.
National attention: hundreds of millions of viewers expect flawless operation.
Lack of historical experience for such scale.
Optimization Targets
Bandwidth: peak demand of 3 Tb/s for multimedia resources.
Access quality: ensuring stable external network for an estimated 350 million concurrent users.
Massive request handling: supporting two 10 million‑per‑second pipelines (external and internal).
V0.5 Test Version
The goal was to address the prototype’s shortcomings.
Resource Pre‑download
Static resources are pushed to CDN and pre‑downloaded to clients days in advance, eliminating real‑time bandwidth pressure.
External Network Access
Redundant deployment across Shanghai and Shenzhen IDC, each with nine TGW clusters and three carrier lines (Telecom, Mobile, Unicom). A total of 638 access servers support up to 1.46 billion concurrent online users.
Embedding Shake Logic into Access Service
To eliminate the 10 million‑per‑second forward from the shake service, the shake logic was merged into the access service, keeping the access service stable for other core functions.
The access service consists of a network‑IO module (handling TCP connections) and an access‑logic module (processing requests). The shake logic runs as an embedded component within the logic module, while a separate shake‑agent handles more complex tasks via shared memory.
Red Packet Distribution
Seed red‑packet files are pre‑deployed on access servers, split per machine, and merged for verification. A cookie‑based mechanism records per‑user limits (max 3 packets per user, 1 per sponsor) without backend storage. Additional anti‑cheat measures include server‑side aggregation of limit data and a “guerrilla” strategy to mitigate rapid reconnections.
Live Show Synchronization
A configuration front‑end allows operators to push changes to two redundant back‑ends (Shanghai and Shenzhen). Changes propagate via RPC, rsync, and a change‑system, reaching all access services within ten seconds. A countdown‑based configuration automates the start and stop of red‑packet phases, resilient to network failures.
Overload Protection
Clients self‑throttle when services time out or rate‑limit. The access service monitors CPU usage and returns graded rate limits, causing the client to back‑off, thereby protecting the backend.
V0.8 Preview Version
The focus shifted to the core experience of opening and sharing red packets. A middle layer consisting of a "red‑packet simplification" component and an asynchronous queue decouples the user‑side information flow from the backend business flow, forming an "iron triangle" that safeguards user experience even if the red‑packet system fails.
Confidence index rose to 70 %.
V1.0 Official Release
Final production version added full‑scale stress testing, cross‑team code reviews, internal rehearsals, and two pre‑heat runs (Feb 12 and Feb 15, 2015). The system handled 3.1 billion total shakes with a peak of 5 × 10⁷ shakes per minute and sustained a red‑packet issuance rate of 5 × 10⁴ per second.
2015‑02‑12
3.1 billion shakes
5 × 10⁷ shakes/min
5 × 10⁴ packets/s
2015‑02‑15
400 million shakes
1.7 × 10⁷ shakes/min
5 × 10⁴ packets/s
The confidence index reached 80 %, with the remaining risk attributed to unforeseen incidents and luck.
Postscript
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Backend Team
Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
