Building LLRepeater: HuoLala’s Scalable Java Traffic Replay Platform
Facing rapid growth and complex service chains, HuoLala’s tech team designed LLRepeater, a Java‑based traffic replay platform that records, stores, and replays production traffic to improve test coverage, reduce manual effort, and enhance system stability, detailing architecture, core functions, challenges, and future plans.
Background and Challenges
HuoLala is a fast‑growing company. As business expands, the technical team faces rapid iteration and increasingly complex service chains, putting pressure on service stability and efficiency. Traditional interface‑level automation was costly, had limited coverage, and was hard to standardize, prompting the need for a traffic‑replay platform that can automatically generate rich test scenarios from massive traffic and reduce manual effort.
LLRepeater
Traffic replay records real production traffic and replays it in a test environment to verify code logic. Open‑source tools such as jvm‑sandbox‑repeater, GoReplay, and TcpXCopy exist, but because HuoLala’s stack is Java‑centric and many companies (vivo, Dewu, KuJiaLe) use jvm‑sandbox‑repeater, the team chose it as the foundation for their LLRepeater platform.
The following sections describe the design, implementation, challenges, and results of the traffic‑replay system.
2.1 Platform Architecture Design
Platform consists of five major components:
Plugin Management – handles plugin install, uninstall, heartbeat, start, freeze, and configuration push, interacting with the jvm‑sandbox agent.
Traffic Center – builds a traffic library, offering query, storage, analysis, and construction functions to empower automation and stress‑test models.
Recording Service – responsible for traffic recording, storage, tagging, and filtering.
Replay Service – manages real‑time, manual, automatic replay, replay rate, and result aggregation.
Comparison Service – provides noise‑reduction configuration, result comparison, and diff analysis.
2.2 Core Function Implementation
Core functions are divided into three parts:
Mount – the console issues mount commands to target services, triggering jvm‑sandbox‑repeater mounting.
Record – the agent records traffic and forwards it to the console, which pushes it to a Kafka topic; the recording service consumes the messages and stores them in Elasticsearch.
Replay – the console initiates replay, forwards results to a Kafka topic, the replay service consumes them, invokes the comparison service for noise reduction, and stores outcomes in Elasticsearch and a database.
2.3 Platform Challenges
2.3.1 Underlying Plugin Adaptation
HuoLala’s diverse business lines require continuous plugin adaptation. The table below lists supported component types and their recording/replay capabilities.
http – request – ✅ – ✅ – ✅ rpc – custom RPC – ✅ – ✅ – ✅ apollo – service config – ✅ – ✅ – ❌ feign – easyOpen – ✅ – ✅ – ❌ ibatis – ibatis – ✅ – ✅ – ❌ mybatis – mybatis – ✅ – ✅ – ❌ redis – redis – ✅ – ✅ – ❌ lala‑redis – Lala redis – ✅ – ✅ – ❌ guava‑cache – guava cache – ✅ – ✅ – ❌ ehcache – ehcache – ✅ – ✅ – ❌ caffeine‑cache – caffeine cache – ✅ – ✅ – ❌ kafka – kafka – ✅ – ✅ – ✅
2.3.2 Frequent Full GC
Early adoption caused frequent Full GC and service timeouts due to:
Repeated sub‑calls (e.g., Apollo, Guava cache) appearing more than ten times per flow, creating large objects in memory.
Oversized serialized objects from non‑business code such as service discovery.
Optimizations applied:
Record Apollo and Guava repeated requests only once.
Skip non‑business requests that generate excessively large serialized payloads.
Monitor and degrade recording process on exceptions.
2.3.3 Uneven Traffic Recording and Filtering Difficulty
High QPS services can affect performance if all traffic is recorded, and many recordings are duplicate. To address this, the platform supports interface‑level configuration with priority over global sampling, allowing flexible ratio adjustments, and provides multi‑dimensional tagging for easy filtering.
Filtering is facilitated by tagging each flow on multiple dimensions, enabling users to extract the traffic they need.
2.3.4 High Cost of Replay Troubleshooting
Complex recording and replay pipelines make debugging difficult, especially for flows with hundreds of sub‑calls. Improvements include:
Integration with the company’s monitoring chain to view full trace of recording and replay.
Agent optimizations and ordered sub‑call sorting, with match tags on replay details.
Batch result aggregation by failure reason to reduce duplicate investigations.
Recommended failure‑analysis strategies and mentorship programs.
2.4 Platform Effects
The platform focuses on low entry cost, actionable insights, diversified capabilities, and intelligence.
Advantages:
Low onboarding cost – a newcomer can complete integration, configuration, recording, and replay within ~30 minutes.
Convenient and efficient use – multi‑rule tagging, intelligent noise reduction, and result aggregation lower troubleshooting effort.
Result transparency – integration with coverage metrics and automatic report generation.
Personalized support – supports timed, real‑time, and manual replay modes.
Key Features:
Automatic Replay – for read‑only interfaces, manual or scheduled replay uses real recorded traffic, achieving near‑zero testing manpower with low false‑positive rates.
Link‑Level Replay – combines single‑service recording with global trace capabilities to replay entire service chains, already proven in a major core‑fulfilment upgrade.
External Traffic Input – supports ingesting traffic from outside the platform and replaying it, with flexible language support (currently Java, with plans for other languages).
3. Practice and Deployment
Despite the platform’s capabilities, many companies struggle to adopt traffic replay due to high expertise requirements and sensitivity to code changes. HuoLala applied a step‑by‑step approach to reduce tester burden and increase system stability.
3.1 Non‑Mock Replay
Non‑mock replay is unaffected by code changes but requires the same environment for recording and replay, sharing storage without data modification. The platform supports Java‑to‑Java and PHP‑to‑Java modes, with manual or automatic replay. To date, >5 k replays covering 600 k flows have intercepted over 200 issues.
3.2 Mock Replay
Mock replay is vulnerable to code changes and cannot validate middleware, making it suitable for stable business lines. It has been integrated into pre‑release and regression pipelines, achieving >1 k replays over 100 k flows and catching more than 10 issues.
3.3 Link‑Level Replay
Link‑level replay requires higher integration effort due to service‑level instrumentation but fills gaps of single‑service replay. It has been used in a major core fulfilment chain upgrade, intercepting over 180 issues and ensuring robustness.
4. Future Plans
After iterative optimization, the three replay modes have delivered measurable benefits. Future goals focus on efficiency, process control, intelligence, and expansion.
4.1 Precise Intelligent Traffic Selection
Leverage the existing precision platform to automatically identify impacted methods and interfaces for targeted replay, reducing resource and labor costs.
4.2 CI/CD Integration
Integrate replay workflows into the company’s entry‑exit pipelines to ensure consistent execution without manual variance.
4.3 Intelligent Analysis
Build a knowledge base from replay comparisons and apply AI techniques to assist users in problem analysis, lowering the usage barrier.
4.4 Multi‑Language Support
Extend LLRepeater beyond Java to support languages such as Go and PHP, broadening its applicability across business domains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
