Operations 15 min read

Designing an Effective Full‑Link Load‑Testing Strategy for High‑Traffic Systems

This guide explains how to plan, configure, and execute full‑link performance testing—including network architecture, testing objectives, environment isolation, platform setup, various load‑generation methods, monitoring, and post‑test analysis—to ensure reliable, scalable services under heavy traffic.

ByteDance SE Lab
ByteDance SE Lab
ByteDance SE Lab
Designing an Effective Full‑Link Load‑Testing Strategy for High‑Traffic Systems

Network Architecture

The purpose is to understand how business requests flow through the network and identify every node as a testing target, focusing on their performance.

A typical architecture routes user requests through CDN, TTGW (a high‑performance L4 load balancer), TLB (L7 load balancer), AGW (API gateway), and finally to the business service PSM.

Testing Objectives and Plans

Before any full‑link test, testers must define clear objectives to choose an appropriate plan, improving efficiency and saving time for subsequent testing phases.

Typical scenarios include capability verification, capacity planning, performance tuning, defect reproduction, and benchmark comparison, each with specific goals and characteristics.

Testing Targets

Based on the network diagram, targets are selected according to the testing purpose, such as bandwidth testing from CDN/TLB, business‑logic testing from AGW and services, or asynchronous message testing from MQ producers.

Environment Isolation

In‑house BOE environments are not used for testing due to resource constraints and differences from production; tests run on online or PPE environments, with strict isolation of test data from real data.

Testing Markers

A special field stress_tag marks test traffic, allowing the framework to differentiate and handle it appropriately across services.

Testing Switch

A global switch stored in etcd (key = psm/cluster) controls whether test traffic is accepted; each service cluster has its own switch, and traffic is rejected if the switch is off.

Storage Isolation

Test data is stored separately from production data, often using shadow tables, to prevent pollution of real data while allowing performance evaluation under test conditions.

Platform Setup

The Rhino platform manages test tasks, data, agents, and results, integrating systems such as Bytemesh, User, Trace, Bytemock, and Bytecopy.

Testing Methods

Fake Traffic

Users construct requests manually; Rhino injects test markers. Supports HTTP and Thrift.

Custom Plugin (GoPlugin)

Loads Go plugins at runtime to generate requests for custom protocols or complex scenarios; users must add test markers themselves.

Traffic Recording & Replay

Captures live traffic, rewrites it into test requests, and replays it, preserving real request distribution.

Traffic Scheduling

Adjusts Consul weights to direct live traffic to a target instance without test markers, monitoring service metrics and stopping when thresholds are reached.

Monitoring

Real‑Time Monitoring

Provides second‑level client‑side metrics to quickly detect anomalies during a test.

Server‑Side Monitoring

Aggregates service, machine, and downstream metrics via Grafana (future Argos) for full‑link visibility.

MS Alarm Monitoring

Stops tests automatically when service or downstream alarms trigger, preventing production incidents.

Analysis & Optimization

After testing, analysis methods include client/server monitoring, Nemo performance analysis platform (pprof for Go/Python), and system‑level tracing. Common issues cover CPU spikes, goroutine leaks, memory leaks, database write bottlenecks, Redis timeouts, and insufficient CPU utilization under load.

MonitoringplatformLoad TestingAnalysisfull‑link
ByteDance SE Lab
Written by

ByteDance SE Lab

Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.