Operations 16 min read

How Havok Enables Realistic Full‑Link Load Testing for Scalable Services

This article explains how the Havok full‑link load testing platform was designed and built to replay real traffic safely, provide capacity‑assessment data, support multiple test types, and offer real‑time monitoring and circuit‑breaker protection for large‑scale online services.

SQB Blog

May 9, 2022

How Havok Enables Realistic Full‑Link Load Testing for Scalable Services

1. Background

As the company's transaction volume grows and the business expands to serve millions of merchants, occasional failures have caused poor user experience and significant losses.

Why do offline tests still result in online issues after launch?

Can we support upcoming promotional activities?

Can we reduce online IT costs despite limited business growth?

These common concerns are addressed by full‑link load testing.

2. Solution

2.1 Traditional Online Load Testing

Before full‑link testing, we performed online load testing by:

Preparing test data in the production environment and simulating requests against a single service or cluster.

Using an Nginx mirror to generate traffic.

This approach suffers from:

Time‑consuming data preparation.

Pollution of production databases with dirty data.

Manually built test models leading to inaccurate results.

Narrow coverage limited to core services.

Inability to cover infrastructure such as SLB, Nginx, network, databases.

2.2 Current Solution

Based on the above issues and the specific needs of the business, we designed and built the full‑link load testing platform Havok . Its primary goals are to generate realistic, safe replay traffic and provide accurate capacity‑assessment data.

Realistic user‑behavior replay

Continuously replay real traffic without polluting production data, invisible to users.

Rate and multiplier amplification

Scale traffic by predefined rates or multipliers to probe capacity.

“Out‑of‑the‑box” testing

Start tests on demand without extensive data preparation, while keeping production data untouched.

Support for multiple test types

HTTP API testing, internal RPC testing, and special mobile protocols.

Real‑time monitoring and overload protection

Collect monitoring data during tests and automatically stop tests based on predefined rules.

3. System Architecture

We replay production service logs, controlling request timing based on timestamps to achieve high‑fidelity testing.

Havok-dispatcher

: Download, sort, time‑control, dispatch requests, and collect monitoring data. Havok-replayer: Replay requests from the dispatcher, support traffic amplification and rule adjustments. Havok-monitor: Aggregate and display data from the test engine, services, and middleware. Havok-mock: Provide mock services. Havok-canal: Real‑time incremental shadow data cleaning.

4. Core Module Functions

4.1 Dispatcher Core

Handles log extraction and request dispatch, supporting multiple data sources, dimension filtering, ordered log distribution, amplification, monitoring, and engine management.

Why use log replay?

Given an order‑creation API POST /api/order, replaying real logs automatically provides realistic request scenarios without manually constructing diverse test data.

Achieving high fidelity

We control replay speed using timestamps via the timewheel module, supporting fast‑forward and slow‑motion playback.

Amplification capability

Requests can be multiplied to simulate traffic growth, and idempotency is handled either by Havok’s custom keyword offset or by the service itself.

f(x) = f(f(x))

4.2 Test Engine Core

Distributed container deployment

Containerization enables rapid scaling of the test engine.

Asynchronous messaging

Implemented with Go goroutine, offering low‑overhead context switching, small memory footprint, and the G‑M‑P scheduler.

Goroutine switches have no kernel overhead.

Memory usage is low; stack typically 2 KB.

G‑M‑P scheduling model.

Request/response field filtering

Unified handling of sensitive data, offsetting per rules, and custom assertions for responses.

Interface‑level metrics

Collect error rate, throughput, P95, etc., and report to the dispatcher.

Dispatcher event handling

Process start, stop, and circuit‑breaker commands from the dispatcher.

4.3 Data Construction

Traditional load testing requires extensive data preparation, leading to imbalanced samples, insufficient volume, and long build times.

We built a custom incremental sync service based on Alibaba canal to provide real‑time shadow data cleaning, enabling on‑demand testing.

Sensitive fields (phone, ID) are uniformly masked.

Merchant, store, and terminal IDs are offset according to rules without affecting production usage.

4.4 Mock Third‑Party Services

We use a self‑developed mock service DeepMock that supports latency jitter and can be tuned via post‑test statistical analysis.

4.5 Monitoring

Load testing must not impact production, so we implement second‑level monitoring and rapid circuit‑breaker reactions.

Client‑side monitoring

The engine aggregates per‑interface metrics (error rate, throughput, P90, P95, avg latency) each second and reports to the dispatcher.

Server‑side monitoring

We rely on existing cloud monitoring facilities for middleware metrics.

4.6 Test Isolation

To ensure safety, we tag test traffic with a key:value identifier that propagates via context, allowing services and middleware to recognize and isolate test traffic.

4.7 Data Isolation

We prevent test‑generated writes from contaminating production data using shadow tables, shadow databases, and data offset strategies for MySQL, MongoDB, Redis, Kafka/MQ, and Elasticsearch.

4.8 Circuit‑Breaker Protection

Client‑side circuit breaker

Havok analyzes monitoring data in real time and can lower QPS or stop the test based on configured thresholds.

Server‑side circuit breaker

Production services implement their own circuit‑breaker logic via middleware, automatically tripping on error‑rate thresholds.

5. Test Execution

Core business lines such as store‑code payment, scan‑pay, and mini‑program payment have been onboarded. Havok is open‑source and welcomes contributions.

6. Summary and Outlook

The project progressed from inception to production with strong support from R&D and business teams. Future work will focus on improving usability, visual operation, capacity planning, cost optimization, and integration with chaos testing.

6.1 Improving Usability

We aim to invest in visual tools to make the platform more user‑friendly and enable “one‑click” testing.

6.2 Load Testing and SLA Building

Key questions include precise capacity assessment, resource optimization, cost reduction, and alignment with company‑wide chaos testing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring capacity planning Load Testing performance engineering full‑link testing mock services

Written by

SQB Blog

Thank you all.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.