Transforming Load Testing at ZTO: From Offline Pitfalls to Safe Full‑Chain Online Testing
This article details ZTO's evolution from traditional offline and online load‑testing approaches—highlighting their shortcomings—to a comprehensive full‑chain performance testing solution that uses JavaAgent probes, shadow resources, and a structured deployment and verification process to ensure safe, accurate production testing.
Background
ZTO's performance testing historically used both online and offline load‑testing methods, each with distinct drawbacks that complicated testing workflows.
Issues with Offline Load Testing
Offline testing relied on proportional scaling of CPU and memory (e.g., 10:1 ratio) to mimic production environments. This approach distorted results for high‑TPS scenarios because network, middleware, and database factors could not be scaled proportionally.
Issues with Online Load Testing
Online testing focused mainly on read interfaces, requiring developers to modify code to prevent test data from contaminating production data. This manual hard‑coding increased cost, risk, and resistance to online testing, leading many teams to prefer offline testing despite its inaccuracies.
Introducing Technical Solution
ZTO adopted an online full‑chain load‑testing product that injects a JavaAgent probe via bytecode manipulation, eliminating the need for code changes. The agent identifies test traffic, routes it to shadow resources, and isolates test data from production data, achieving two core functions: traffic coloring and data safety.
Figure: Load‑testing traffic flow diagram.
Full‑Chain Load Testing Deployment & Core Config
Agent installation steps include uploading pradar-agent.zip to the target server, extracting it, and modifying the application startup script to add the following JVM argument before -jar:
-javaagent:/home/admin/pradar-agent/agent/pradar-core-ext-bootstrap-1.0.0.jar -Dpradar.project.name=APP_NAMEAfter restarting the application, verify successful reporting in the Pradar web console.
Shadow Resource Configuration
For Redis, the agent prefixes keys with PT_. For MQ, shadow topics and consumer groups are created with the same PT_ prefix via the ZMS configuration center. Shadow databases and tables are configured by adding PT_ to the original names, as shown in the diagrams.
Mock (Shield) Configuration
Sensitive operations such as payment deductions or SMS sending are protected by mock implementations. When the probe detects test traffic, it executes the mock code instead of the real business logic, preventing unintended side effects.
Link Integration & Testing Process
The full‑chain testing process consists of three stages: (1) Requirement definition and link mapping, (2) Test environment deployment and execution, and (3) Online testing and result generation.
Requirement Definition & Link Mapping
Key activities include defining business scope, performance goals (TPS, RT, success rate, SA), and collecting detailed information about applications, databases, caches, middleware, and potential sensitive impacts. This information guides shadow resource setup and mock configuration.
Testing Environment Debugging
Test traffic is marked with User-Agent:PerfomanceTest for HTTP or p-pradar-cluster-test:true for Dubbo. After confirming isolation in the test environment, the team proceeds to online testing.
Online Testing & Result Generation
Preparation includes creating shadow databases, topics, and configuring the Pradar web console. A rollout plan is documented, followed by gray‑scale verification, full rollout, and finally the online test execution using JMeter scripts with ramp‑up (step‑up) mode.
Result Analysis
Test reports display performance metrics and leak detection results. Leak detection monitors whether test data inadvertently writes to production tables, using binlog listeners to trigger alerts.
Thoughts on Full‑Chain Testing Practice
Since adopting the probe‑based approach, ZTO has successfully supported 62 applications across major sales events, solving many previous issues while introducing new challenges such as increased coordination effort.
Organization & Work Mode Issues
Two organizational models are compared: a company‑level project with top‑down drive versus a department‑level project led by the performance testing team, each requiring cross‑functional collaboration.
Other Issues
Low automation in probe version control and load‑generator provisioning.
Manual, offline process for configuration review and approval.
Limited reuse of test scripts and data.
Absence of automated baseline comparison and visual analytics for test results.
Conclusion
ZTO's full‑chain load testing has eliminated scaling‑induced inaccuracies, but managing shadow resources across dozens of applications remains complex and resource‑intensive, requiring ongoing process improvements and performance‑focused innovation initiatives.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Zhongtong Tech
Integrating industry and information for digital efficiency, advancing Zhongtong Express's high-quality development through digitalization. This is the public channel of Zhongtong's tech team, delivering internal tech insights, product news, job openings, and event updates. Stay tuned!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
