Design and Practice of a Full-Link Load Testing Platform
This article describes the motivation, core design, technical choices, data and traffic isolation mechanisms, and implementation steps of a self‑developed full‑link load testing platform that enables production‑environment testing, reduces machine costs, and improves system stability and performance monitoring.
As business scale grows, ensuring system stability becomes critical; traditional load testing suffers from high cost, inability to test live interfaces, and lack of historical data comparison. To address these issues, a self‑developed full‑link load testing platform was built to test production interfaces directly, save resources, monitor node health, and quickly identify weak points.
What is full‑link load testing – it simulates massive user requests on real production scenarios, covering traffic recording, replay, and pressure application, offering advantages such as real‑world request scenarios, significant machine cost savings, comprehensive link monitoring, and rapid problem discovery.
Technical selection – two main approaches were evaluated: traffic marking and machine marking. Traffic marking isolates data at DB, cache, and MQ layers using shadow resources, while machine marking deploys separate machines and resources. Traffic marking was chosen for its maturity (used by Meituan and Alibaba), lower cost, and easier integration with existing middleware.
Platform core design
1. Overall architecture – includes a control center (brain) for task creation, configuration, and reporting, and a pressure engine (duckpear‑engine) comprising kafka‑replay, goreplay, and vegeta.
2. Components – vegeta (customized Go‑based load generator), goreplay (HTTP traffic recorder/replayer), and kafka‑replay (Kafka write performance tester). Each component was extended to support rate control, parameter construction, result assertions, and Prometheus monitoring.
Data isolation – achieved via shadow databases, shadow Redis keys, and shadow Kafka topics, ensuring test traffic does not affect production data.
Traffic isolation – integrated into the PIE framework with traffic identification, circuit breaking, mock services, and routing to shadow resources based on request headers.
Platform core functions
• Kafka replay – configurable thread count, rate, and offset or time‑based replay with visual reporting.
• Traffic recording and replay – HTTP flow capture to COS, configurable machines, duration, filters, and replay speed.
• Interface testing – supports serial and parallel execution, thread and QPS control, parameter templating, and report generation.
Pressure engine scaling – distributed architecture allows adding machines to SFNS, publishing services via a unified platform, and mapping engines to test targets for horizontal scaling.
Full‑link testing implementation – divided into pre‑test (data preparation, configuration checks, risk assessment), during‑test (monitoring node metrics and aborting on anomalies), and post‑test (result analysis, bottleneck identification, and goal verification).
In summary, the platform has been deployed in several business lines, reducing testing barriers and resource consumption, though challenges remain in large‑scale data preparation; future work will focus on streamlining data setup and further enhancing the platform.
Beijing SF i-TECH City Technology Team
Official tech channel of Beijing SF i-TECH City. A publishing platform for technology innovation, practical implementation, and frontier tech exploration.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.