How Havok Enables Realistic Full‑Link Load Testing for Scalable Services
This article explains how the Havok full‑link load testing platform was designed and built to replay real traffic safely, provide capacity‑assessment data, support multiple test types, and offer real‑time monitoring and circuit‑breaker protection for large‑scale online services.
1. Background
As the company's transaction volume grows and the business expands to serve millions of merchants, occasional failures have caused poor user experience and significant losses.
Why do offline tests still result in online issues after launch?
Can we support upcoming promotional activities?
Can we reduce online IT costs despite limited business growth?
These common concerns are addressed by full‑link load testing.
2. Solution
2.1 Traditional Online Load Testing
Before full‑link testing, we performed online load testing by:
Preparing test data in the production environment and simulating requests against a single service or cluster.
Using an Nginx mirror to generate traffic.
This approach suffers from:
Time‑consuming data preparation.
Pollution of production databases with dirty data.
Manually built test models leading to inaccurate results.
Narrow coverage limited to core services.
Inability to cover infrastructure such as SLB, Nginx, network, databases.
2.2 Current Solution
Based on the above issues and the specific needs of the business, we designed and built the full‑link load testing platform Havok . Its primary goals are to generate realistic, safe replay traffic and provide accurate capacity‑assessment data.
Realistic user‑behavior replay
Continuously replay real traffic without polluting production data, invisible to users.
Rate and multiplier amplification
Scale traffic by predefined rates or multipliers to probe capacity.
“Out‑of‑the‑box” testing
Start tests on demand without extensive data preparation, while keeping production data untouched.
Support for multiple test types
HTTP API testing, internal RPC testing, and special mobile protocols.
Real‑time monitoring and overload protection
Collect monitoring data during tests and automatically stop tests based on predefined rules.
3. System Architecture
We replay production service logs, controlling request timing based on timestamps to achieve high‑fidelity testing.
Havok-dispatcher: Download, sort, time‑control, dispatch requests, and collect monitoring data. Havok-replayer: Replay requests from the dispatcher, support traffic amplification and rule adjustments. Havok-monitor: Aggregate and display data from the test engine, services, and middleware. Havok-mock: Provide mock services. Havok-canal: Real‑time incremental shadow data cleaning.
4. Core Module Functions
4.1 Dispatcher Core
Handles log extraction and request dispatch, supporting multiple data sources, dimension filtering, ordered log distribution, amplification, monitoring, and engine management.
Why use log replay?
Given an order‑creation API POST /api/order, replaying real logs automatically provides realistic request scenarios without manually constructing diverse test data.
Achieving high fidelity
We control replay speed using timestamps via the timewheel module, supporting fast‑forward and slow‑motion playback.
Amplification capability
Requests can be multiplied to simulate traffic growth, and idempotency is handled either by Havok’s custom keyword offset or by the service itself.
f(x) = f(f(x))4.2 Test Engine Core
Distributed container deployment
Containerization enables rapid scaling of the test engine.
Asynchronous messaging
Implemented with Go goroutine, offering low‑overhead context switching, small memory footprint, and the G‑M‑P scheduler.
Goroutine switches have no kernel overhead.
Memory usage is low; stack typically 2 KB.
G‑M‑P scheduling model.
Request/response field filtering
Unified handling of sensitive data, offsetting per rules, and custom assertions for responses.
Interface‑level metrics
Collect error rate, throughput, P95, etc., and report to the dispatcher.
Dispatcher event handling
Process start, stop, and circuit‑breaker commands from the dispatcher.
4.3 Data Construction
Traditional load testing requires extensive data preparation, leading to imbalanced samples, insufficient volume, and long build times.
We built a custom incremental sync service based on Alibaba canal to provide real‑time shadow data cleaning, enabling on‑demand testing.
Sensitive fields (phone, ID) are uniformly masked.
Merchant, store, and terminal IDs are offset according to rules without affecting production usage.
4.4 Mock Third‑Party Services
We use a self‑developed mock service DeepMock that supports latency jitter and can be tuned via post‑test statistical analysis.
4.5 Monitoring
Load testing must not impact production, so we implement second‑level monitoring and rapid circuit‑breaker reactions.
Client‑side monitoring
The engine aggregates per‑interface metrics (error rate, throughput, P90, P95, avg latency) each second and reports to the dispatcher.
Server‑side monitoring
We rely on existing cloud monitoring facilities for middleware metrics.
4.6 Test Isolation
To ensure safety, we tag test traffic with a key:value identifier that propagates via context, allowing services and middleware to recognize and isolate test traffic.
4.7 Data Isolation
We prevent test‑generated writes from contaminating production data using shadow tables, shadow databases, and data offset strategies for MySQL, MongoDB, Redis, Kafka/MQ, and Elasticsearch.
4.8 Circuit‑Breaker Protection
Client‑side circuit breaker
Havok analyzes monitoring data in real time and can lower QPS or stop the test based on configured thresholds.
Server‑side circuit breaker
Production services implement their own circuit‑breaker logic via middleware, automatically tripping on error‑rate thresholds.
5. Test Execution
Core business lines such as store‑code payment, scan‑pay, and mini‑program payment have been onboarded. Havok is open‑source and welcomes contributions.
6. Summary and Outlook
The project progressed from inception to production with strong support from R&D and business teams. Future work will focus on improving usability, visual operation, capacity planning, cost optimization, and integration with chaos testing.
6.1 Improving Usability
We aim to invest in visual tools to make the platform more user‑friendly and enable “one‑click” testing.
6.2 Load Testing and SLA Building
Key questions include precise capacity assessment, resource optimization, cost reduction, and alignment with company‑wide chaos testing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
