Designing an Effective Full‑Link Load‑Testing Strategy for High‑Traffic Systems
This guide explains how to plan, configure, and execute full‑link performance testing—including network architecture, testing objectives, environment isolation, platform setup, various load‑generation methods, monitoring, and post‑test analysis—to ensure reliable, scalable services under heavy traffic.
Network Architecture
The purpose is to understand how business requests flow through the network and identify every node as a testing target, focusing on their performance.
A typical architecture routes user requests through CDN, TTGW (a high‑performance L4 load balancer), TLB (L7 load balancer), AGW (API gateway), and finally to the business service PSM.
Testing Objectives and Plans
Before any full‑link test, testers must define clear objectives to choose an appropriate plan, improving efficiency and saving time for subsequent testing phases.
Typical scenarios include capability verification, capacity planning, performance tuning, defect reproduction, and benchmark comparison, each with specific goals and characteristics.
Testing Targets
Based on the network diagram, targets are selected according to the testing purpose, such as bandwidth testing from CDN/TLB, business‑logic testing from AGW and services, or asynchronous message testing from MQ producers.
Environment Isolation
In‑house BOE environments are not used for testing due to resource constraints and differences from production; tests run on online or PPE environments, with strict isolation of test data from real data.
Testing Markers
A special field stress_tag marks test traffic, allowing the framework to differentiate and handle it appropriately across services.
Testing Switch
A global switch stored in etcd (key = psm/cluster) controls whether test traffic is accepted; each service cluster has its own switch, and traffic is rejected if the switch is off.
Storage Isolation
Test data is stored separately from production data, often using shadow tables, to prevent pollution of real data while allowing performance evaluation under test conditions.
Platform Setup
The Rhino platform manages test tasks, data, agents, and results, integrating systems such as Bytemesh, User, Trace, Bytemock, and Bytecopy.
Testing Methods
Fake Traffic
Users construct requests manually; Rhino injects test markers. Supports HTTP and Thrift.
Custom Plugin (GoPlugin)
Loads Go plugins at runtime to generate requests for custom protocols or complex scenarios; users must add test markers themselves.
Traffic Recording & Replay
Captures live traffic, rewrites it into test requests, and replays it, preserving real request distribution.
Traffic Scheduling
Adjusts Consul weights to direct live traffic to a target instance without test markers, monitoring service metrics and stopping when thresholds are reached.
Monitoring
Real‑Time Monitoring
Provides second‑level client‑side metrics to quickly detect anomalies during a test.
Server‑Side Monitoring
Aggregates service, machine, and downstream metrics via Grafana (future Argos) for full‑link visibility.
MS Alarm Monitoring
Stops tests automatically when service or downstream alarms trigger, preventing production incidents.
Analysis & Optimization
After testing, analysis methods include client/server monitoring, Nemo performance analysis platform (pprof for Go/Python), and system‑level tracing. Common issues cover CPU spikes, goroutine leaks, memory leaks, database write bottlenecks, Redis timeouts, and insufficient CPU utilization under load.
ByteDance SE Lab
Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
