Operations 15 min read

Tencent Billing’s Secret to Managing Massive Promo Spikes

Tencent’s billing platform powers billions of daily transactions across 180+ countries, supporting both consumer and business payments, and employs sophisticated capacity testing, dynamic auto‑scaling, resource sharing, and change‑control mechanisms to ensure reliable large‑scale promotional events without service disruptions.

Efficient Ops
Efficient Ops
Efficient Ops
Tencent Billing’s Secret to Managing Massive Promo Spikes

1 What does Tencent Billing do

Tencent Billing Platform is an end‑to‑end online transaction system whose core helps users and products complete payments and collections safely and conveniently, maximizing product revenue during transactions.

The platform handles billions of daily revenue, providing payment channels, marketing, account hosting, risk control, settlement, and recommendation services for over 180 countries/regions, thousands of business codes, and more than one million settlement merchants.

Analogous to a restaurant’s front‑desk counter, Tencent Billing allows payments via WeChat Pay, bank cards, Apple Pay, QQ Wallet, recharge cards, coupons, and other methods.

It covers both ToC scenarios (e.g., users recharging their QB accounts or buying in‑game items) and ToB scenarios (e.g., advertisers, streamers, Tencent Cloud customers).

In account hosting, it manages over 360 billion valuable accounts, covering most ToC and ToB business accounts.

In real‑time transactions, it covers more than 90 % of internal real‑time trades, with full settlement coverage.

2 Large‑scale marketing activities

The platform offers core services for internal marketing campaigns, allowing businesses to configure various promotions such as first‑recharge gifts, purchase‑threshold gifts, cumulative gifts, discounts, lotteries, group buying, and bargaining. Tens of thousands of activities are supported each year (4.2 W in 2018).

Top games like Honor of Kings, CrossFire, and Flying Cars, as well as subscription products like Tencent Video and QQ Membership, have tens of millions to over a hundred million active users; promoting through activities can trigger explosive revenue growth.

However, marketing activities can encounter failures. The billing platform, responsible for the company’s revenue, faces high risk and pressure when supporting such campaigns.

Before a robust activity‑protection system is built, issues such as service overload, low scaling efficiency, and change impact frequently occur.

Key characteristics of Tencent Billing’s large‑scale marketing activities include:

1. Long activity chain : A single promotion (e.g., first‑recharge gift) may involve login verification, region query, character query, risk check, activity data fetch, order creation, payment, and item delivery, totaling dozens of calls.

2. Peak traffic far exceeds normal traffic : Activity peaks can be dozens of times higher than regular traffic, straining limited shared platform resources.

3. Frequent live‑environment changes : Front‑end and back‑end releases, configuration adjustments, and rule updates exceed 300 changes per day, with over 75 % of failures caused by such changes.

4. Incomplete business isolation : The platform supports thousands of businesses without dedicated isolated environments, leading to potential interference and avalanche effects during peak activity.

3 How to evaluate capacity bottlenecks

Capacity evaluation typically uses stress testing. Three common approaches are:

Group services into sets (or buckets) and pre‑test each set’s performance, adding sets to the cluster as needed.

Scale down the existing production environment proportionally for testing.

Simulate user traffic with realistic data to perform end‑to‑end stress testing, which is essential for Tencent Billing due to its long chains and extensive integrations.

Effective simulation requires constructing realistic scenarios based on historical TDW data, covering activity entry, login state, business code, payment channel, version, etc., resulting in millions of test case combinations.

Distributed testing across multiple data centers (e.g., Shenzhen, Shanghai, Tianjin, Chengdu) ensures geographic and network diversity.

A ten‑million‑level phone‑number pool provides realistic user identifiers for scenarios that need region, binding, or friend relationships.

During execution, test cases automatically match appropriate business and number resources.

Stress testing must increase load gradually; if errors or timeouts appear, the test should stop immediately.

4 Dynamically allocate marketing resources

Instead of hoarding maximum resources, the platform adopts dynamic allocation based on real‑time demand and peak‑shaving strategies.

Resources are organized into a shared pool and an emergency pool (the latter used only when the shared pool cannot meet demand).

Automatic scaling is driven by the TSM (Tencent Scaling Management) brain, which collects minute‑level metrics (memory, load, latency, traffic) from the production environment and issues scaling commands according to multi‑level thresholds and trend‑prediction strategies.

Scaling relies on the internally developed TDF framework; pre‑installed libraries enable rapid deployment without significant delay.

Deployments are managed by a global release platform that supports serial and parallel modes, operates across domestic and overseas data centers, and ensures high availability and extensibility.

5 Ensure scaling changes are precise

Three detection mechanisms guarantee the correctness of scaling operations:

Functional probing of new nodes via demo tools.

Horizontal comparison between new and existing nodes.

Automated correlation of real‑time monitoring alerts.

A change‑control platform aggregates scaling and version changes, performing scan checks and probe verification. Scan checks compare success rates, latency, error codes, etc., while probe verification runs business‑scenario tests.

6 Prevent platform avalanche risk

Even with automation, extreme cases may cause avalanche risk. To mitigate this, the platform enforces concurrency limits per channel and business at the entry layer, dynamically adjusting limits as capacity expands or contracts. The TSM brain also performs flexible traffic sharding among service sets.

7 Summary

The Tencent Billing large‑scale activity automation assurance system is built around five ideas: capacity stress testing, rapid auto‑scaling, resource sharing management, change scanning, and concurrency protection.

After implementation, the system reliably supports major holidays and anniversary promotions with minimal on‑call staff, while ongoing work focuses on improving overseas payment support and non‑self‑built data‑center scenarios.

operationsauto scalingcapacity testinglarge-scale promotionsTencent Billing
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.