Operations 21 min read

How Tencent Scales Its Services for Chinese New Year: Inside Cloud Load‑Testing Strategies

This article details Tencent's cloud load‑testing approach for handling massive traffic spikes during Chinese New Year, covering background challenges, model selection, script authoring options, data construction, report analysis, and real‑world case studies that demonstrate capacity planning and performance optimization.

FunTester

Aug 4, 2023

How Tencent Scales Its Services for Chinese New Year: Inside Cloud Load‑Testing Strategies

Background and Challenges

During Chinese New Year, Tencent services experience traffic spikes up to five‑to‑ten times normal, especially QQ and video streaming. Traditional monitoring is passive; cloud load testing provides a proactive way to discover bottlenecks, verify capacity, and ensure service stability.

Solution Overview

The cloud load‑testing platform offers end‑to‑end capabilities: model selection, test‑case authoring, test‑data construction, and report analysis.

2.1 Load‑test Model Selection

Two modes are supported: concurrent‑user (VU) mode and request‑per‑second (RPS) mode. RPS is derived from VU × response time (Little’s law). In the linear region latency stays stable while throughput rises; beyond saturation latency grows sharply and throughput drops.

2.2 Test Case Authoring

Three scripting options are provided:

JS script mode – high‑level language, easy to compose but requires JavaScript familiarity and incurs extra adaptation work.

Go plugin mode – native Go, hot‑loadable, low overhead, better for complex protocols and high concurrency.

Low‑code / JMeter GUI mode – drag‑and‑drop interface for non‑developers, supports HAR→JS conversion and JMeter extensions.

Example JS script:

// Send a http get request
import http from 'pts/http';
import { check, sleep } from 'pts';

export default function () {
    const resp1 = http.get('http://httpbin.org/get');
    console.log(resp1.body);
    console.log(resp1.json());
    check('status is 200', () => resp1.statusCode === 200);
}

Example Go plugin snippet:

var Init = plugin.Init

func Run(ctx context.Context) error {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, "https://httpbin.org/get", nil)
    if err != nil { return err }
    resp, err := http.DefaultClient.Do(req)
    if err != nil { return err }
    defer resp.Body.Close()
    assert.True(ctx, "status code is 200", func() bool { return resp.StatusCode == http.StatusOK })
    return nil
}

2.3 Test Data Construction

The platform provides traffic recording → test‑case conversion, CSV merge, and automatic sharding across load‑generator pods to avoid data duplication and GC pressure. Recorded binary packets are transformed into a cloud‑compatible archive format.

2.4 Report Analysis

Unified observability is built on OpenTelemetry, delivering metrics, traces, and logs. Users can filter by custom status codes, drill into error logs, and perform trace‑based traffic coloring. Sampling strategies balance gauge, counter, and histogram metrics while limiting log volume.

Practical Cases

3.1 Hand‑Q Spring Protection

Spring‑time peaks on read/write paths required a new Go‑plugin test suite. Switching from JS to Go increased 1000‑concurrency throughput by roughly 90% and reduced memory pressure, enabling scaling to 100 k concurrent users across Shanghai, Nanjing, Guangzhou, etc.

Outcomes: early detection of overload, refined retry/timeout policies, and successful capacity expansion.

3.2 Video Service Disaster‑Recovery Drill

Chaos‑engineering style drills validated fallback, rate‑limiting, and circuit‑breaker mechanisms. Integrated SLA monitoring and automatic traffic degradation ensured the system could sustain 10 k RPS per region without cascading failures.

Summary and Outlook

The cloud load‑testing platform now supports HTTP, gRPC, WebSocket, and custom protocols via Go plugins. Future work includes tighter server‑side metric integration, automated capacity‑estimation, and AI‑driven scenario generation to further reduce manual effort and improve test reliability.