Full-Chain Load Testing Architecture and Implementation at Bilibili
Bilibili’s full‑chain load‑testing architecture embeds a unique scenario identifier in every request, propagates it through RPC, databases, caches and async queues, isolates production data via shadow resources, and provides a transparent SDK, link analyzer and mock server, enabling comprehensive testing that uncovered dozens of issues and boosted live‑streaming TPS by 140 % without major code changes.
Full‑chain load testing (全链路压测) simulates normal user operation paths in a production environment to achieve high fidelity and comprehensive scenario coverage. The article introduces Bilibili's practice, building on experiences from Alibaba, Meituan, and ByteDance, and describes how the company integrates full‑chain testing into its infrastructure.
Solution Selection
The required capabilities are divided into two parts: basic load‑testing capabilities (pressure generation, scenario editing, data management) and full‑chain capabilities (identifier propagation and data isolation). Existing Melloi platform already provides the basic capabilities.
Full‑Chain Identifier Propagation
A unique test identifier (the scenario ID) is defined as a string and attached to requests. For RPC protocols (HTTP/GRPC), the identifier is set by the client, transmitted with the request, and read by the server. The transmission format is illustrated in the article.
Within services, the identifier must be carried through business logic, including downstream RPC calls, database accesses, cache accesses, and logging. Implementations for Java and Go are discussed.
For asynchronous scenarios (message queues), two methods are provided:
Shadow‑queue mode: messages with the test identifier are sent to a separate queue.
Metadata mode: the identifier is embedded in the message payload as metadata.
For scheduled tasks, a shadow task is created alongside the normal task, with the identifier injected via configuration.
Data Isolation
To avoid contaminating production data, four isolation rules are defined (passthrough, drop, mock, mirror). The article focuses on mock and mirror implementations, describing shadow databases/tables, shadow keys for caches, and shadow topics for message queues.
System Design
The design aims to minimize business code changes by providing a transparent SDK that handles identifier propagation, data isolation, and configuration. A flexible configuration system allows per‑service, per‑RPC, and per‑data‑layer settings, as well as scenario‑level configurations to precisely control traffic scope.
Link Analyzer
Using Dapper trace data, the link analyzer provides:
Business scenario call‑chain analysis.
Dependency component analysis (cache, DB, etc.).
Identifier coverage analysis.
Configuration coverage analysis.
Mock Server
Mock servers can be selected from open‑source solutions or built in‑house. The SDK forwards matching test requests to the mock server via interceptors, requiring dynamic responses based on request parameters and realistic latency simulation.
Full‑Chain SDK (Go Example)
The SDK is non‑intrusive and requires only a one‑line import and initialization:
import "go-common/library/mirror/mirroragent" mirroragent.Init(nil) defer mirroragent.Close()It also supports adding shadow cron jobs with a single extra argument:
cron.AddFunc("@every 15s", fun1, "loaddata", cronUtil.PreLoad()) // original job cron.AddFunc("@every 15s", fun1, "loaddata", cronUtil.PreLoad(), cronUtil.Mirror("mirrorId")) // shadow jobPlatform Support
The underlying platforms have been extended to manage shadow tables, shadow queues, and shadow cache keys, enabling quick creation, cleanup, and data loading for test scenarios.
Business Transformation
Business changes are limited to SDK integration, special‑scenario adaptations (e.g., scheduled tasks), and addressing Context misuse in Go code. A custom Context lint tool, based on SSA analysis, checks for common Context misuse patterns and is integrated into golangci‑lint.
Load‑Testing Implementation
The end‑to‑end process is summarized in eight steps with detailed guides. After adoption, Bilibili identified over ten production issues (misconfigurations, slow queries, lock conflicts). Optimizations based on test findings increased the live streaming revenue system’s TPS by 140% under the same resources.
References
1. Alibaba’s Double‑11 full‑chain testing practice. 2. ByteDance’s Rhino full‑chain testing. 3. Takin open‑source platform. 4. Uber’s CRISP analysis. 5. GolangCI‑lint usage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
