Design and Implementation of a Traffic Replay System for Bilibili Membership Purchase Service
This article describes how Bilibili's Membership Purchase team built a Java‑based traffic replay platform using JVM‑Sandbox AOP, Kafka, and MySQL to capture real‑world request/response data, serialize it with JSON, and replay it for comprehensive regression testing of complex backend services.
Background Bilibili's Membership Purchase (会员购) e‑commerce platform has grown in scale and complexity, making rapid business iteration challenging due to the need for robust, compatible, and effective testing.
The team first considered expanding test suites and automated regression scripts, but the constantly evolving system made script maintenance impractical, prompting exploration of traffic replay to turn real online data into exhaustive regression test cases.
Research Existing traffic replay tools such as TcpReplay and TcpCopy only copy inbound HTTP traffic and require the replay environment to mirror production databases, caches, and third‑party services—an unrealistic expectation for the Membership Purchase architecture.
Analyzing the system revealed a Java‑centric stack: Spring Cloud microservices, MyBatis (TK) for MySQL, RedisTemplate for caching, and Feign for inter‑service HTTP calls. The goal became to record and replay HTTP entry points, internal DB/Redis/Feign calls, maintain traceable call IDs, support unlimited replay environments, and keep instrumentation overhead minimal.
Attempt Inspired by Alibaba's JVM‑SANDBOX‑REPEATER, the team built a custom Copy Agent based on JVM‑Sandbox to instrument code via AOP. The agent intercepts RestController, MyBatis MapperProxy, Spring Data Redis operations, and FeignClient calls, serializes request/response data (including class metadata) to JSON, and pushes it to Kafka.
Why JSON serialization of metadata? Simple object serialization failed for generics, proxies, and abstract types. Using Jackson to bundle data with type information solved compatibility issues at the cost of larger payloads.
Why Kafka? Asynchronous messaging avoids synchronous HTTP coupling and reduces business impact. Data is first placed in a bounded LinkedBlockingQueue before being sent to Kafka, providing back‑pressure handling.
Why MySQL? Recorded traffic for a few days amounts to ~1 million rows (~10 GB). MySQL offers sufficient capacity, with optional compression (Snappy) and trace‑based sampling. TiDB or time‑series stores are being evaluated.
Implementation Details During recording, each endpoint receives a trace‑based Index stored in a global ConcurrentHashMap (replacing an earlier ThreadLocal approach). The Index uniquely identifies repeated calls to the same entry, enabling precise data lookup during replay.
Replay follows the same instrumentation points: the Repeat Agent reads JSON payloads from Kafka, deserializes them, and returns the recorded result instead of invoking the real DB, Redis, or remote service.
The system also uses a lightweight HTTP header to toggle replay mode, ensuring non‑replay traffic proceeds normally.
Replay Flow 1. Traffic enters with Trace and replay flag. 2. Repeat Agent intercepts and returns recorded data without hitting DB. 3. Redis data is replayed similarly. 4. Feign calls are satisfied from recorded responses. 5. Service A’s overall response is asserted against the recorded response, turning real traffic into exhaustive regression tests.
Future Work The team plans to support customizable replay chains (skip steps, edit data), more flexible response assertions, integration with JaCoCo for coverage, and SkyWalking for online issue replay.
Recruitment The Membership Purchase team is hiring engineers to continue this work.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
