How Bilibili Scaled Live Gift Revenue with Full‑Link Automated Load Testing
This article details Bilibili's end‑to‑end full‑link load‑testing solution for its live‑stream gifting service, covering industry alternatives, the chosen architecture, a three‑stage automated testing framework, link analysis, configuration, validation, and practical case studies to ensure system stability under massive traffic spikes.
Background and Significance
Bilibili's live‑stream gifting revenue system experiences massive traffic surges during large events, demanding high data real‑time performance; traditional load testing cannot accurately emulate production conditions due to data masking and blacklist handling, prompting a need for comprehensive full‑link load testing to assess capacity, discover bottlenecks, and mitigate risks.
Industry Full‑Link Load‑Testing Approaches
Three common schemes were compared:
Scheme 1: Traffic mixing, storage isolation, online pressure – expands services, isolates data via shadow tables/keys, and tags traffic for identification.
Scheme 2: Data marking, logical isolation, online pressure – adds markers to original tables for logical separation, requiring service adaptation.
Scheme 3: Mirror environment or offline testing – builds separate test clusters, but suffers from hardware and data divergence, limiting result relevance.
Given Bilibili's uniform language stack, shared components, and mature service governance, Scheme 1 was selected.
Bilibili's Full‑Link Load‑Testing Architecture
The solution consists of three pillars: traffic mixing, online pressure, and storage isolation. Traffic mixing shares cluster resources during low‑traffic windows and tags requests (HTTP/gRPC) for identification. An SDK intercepts and processes marked traffic. Storage isolation creates shadow tables, keys, and topics for MySQL/TiDB, Redis, and MQ, ensuring test data never contaminates production.
Three‑Stage Automated Testing Design
Stage 1 – Foundation Assurance : Validate new nodes (e.g., mirror SDK, configuration console) to guarantee baseline correctness.
Stage 2 – Business Integration and Full‑Process Verification : Apply automated tests to extensive, evolving business services, handling legacy debt and non‑standard usage patterns.
Stage 3 – Platformization and Visualization : Provide reusable automation for recurring tasks as full‑link testing becomes routine, reducing manual effort and human cost.
Key Testing Challenges
Large number of services and interfaces lead to massive test volume.
Deeply embedded changes affect MySQL, TiDB, Redis, Databus, making manual coverage impractical.
Complex dependencies mean any mis‑configuration can leak test traffic into production, requiring thorough pre‑test “mine‑sweeping”.
Design Principles
Breadth : Leverage an API‑automation platform to achieve wide business coverage.
Efficiency : Replace manual checks with tool‑driven automatic verification.
Full‑Link Automated Testing Workflow
1. Link Analysis
Combines dynamic tracing (trace system) and static linting (bilicontextcheck) to map service call chains, identify context misuse, and ensure trace IDs propagate without interruption.
2. Configuration Confirmation
Using the analysis results, the team configures the load‑testing console: defining target interfaces, downstream dependencies, and rules for pass‑through, mirroring, discard, or mock for databases, caches, and message queues.
3. Automated Validation
Automated cases verify both normal and test traffic across four dimensions:
Response validation – correct output, format, fault tolerance, security.
Storage validation – MySQL/TiDB/Redis reads‑writes meet expectations.
Asynchronous flow validation – ensure test tags traverse async pipelines (e.g., order settlement).
Link completeness validation – confirm trace‑toolset checks every node against configured rules.
Automated Framework Refactor
The existing test framework was reorganized into three layers:
Case layer : business‑specific test scenarios.
Invoker layer : request wrapping (HTTP/gRPC), config management, assertions, middleware integration, and the new trace_toolset for link completeness.
Coverage layer : collects coverage metrics for gRPC and PHP services.
Key refactor points:
Introduce global config.mirror to toggle between normal and test traffic.
Inject mirror identifier into request headers.
Embed trace_toolset for automatic link integrity checks.
Case Practice
Service dependency mapping identified all interfaces needing test integration. Configuration rules (pass‑through, mirror, discard) were applied in the console. Automated cases were executed for both normal and test traffic, covering response, storage, async flow, and trace completeness.
Failed cases were triaged into four categories: service code defects, configuration errors, SDK/console issues, and platform problems. After fixing, cases were re‑run, and successful validation paved the way for staged rollout, gray‑deployment, and finally full‑link production testing.
Future Work
The third stage will address “routine” full‑link testing, further automating repetitive tasks, reducing manpower, and extending the toolset across the entire testing lifecycle.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
