Operations 20 min read

Full‑Link Load Testing and Stability Assurance at Gaode: Architecture, Practices, and Future Directions

To guarantee stability for over 100 million daily users, Gaode combines capacity planning, traffic control, disaster recovery, monitoring, and pre‑plan drills with a self‑built full‑link load‑testing platform (TestPG) that replays realistic traffic in production‑like environments, isolates test loads, provides rapid configuration, detailed debugging, automated error capture, and comprehensive reporting, while planning future enhancements such as integrated topology monitoring, advanced pressure models, and confidence evaluation.

Amap Tech
Amap Tech
Amap Tech
Full‑Link Load Testing and Stability Assurance at Gaode: Architecture, Practices, and Future Directions

In 2018, Gaode's daily active users (DAU) exceeded 100 million, creating significant challenges for ensuring system stability and reliable service delivery.

The platform consists of thousands of online applications deployed across tens of thousands of machines in multiple data centers nationwide.

Stability Assurance Methods

Five fundamental techniques are employed:

Capacity Planning: Estimate future traffic based on historical data and calculate required resources. The basic formula is

MachineCount = EstimatedCapacity / SingleMachineCapacity + Buffer

.

Traffic Control: Apply rate limiting and degradation for traffic that exceeds subsystem capacity.

Disaster Recovery: Switch traffic to backup data centers when catastrophic failures occur.

Monitoring: Comprehensive real‑time monitoring and early warning of anomalies.

Pre‑plan Drills: Conduct full‑scale simulations (e.g., network cuts, power outages) to validate system behavior under disaster scenarios.

Two real‑world incidents (Chinese New Year and May Day) demonstrated that even with thorough planning, unexpected alerts can arise when actual traffic patterns differ from predictions.

Full‑Link Load Testing

Full‑link testing aims to replay realistic traffic in a production‑like environment before traffic peaks. It involves:

Real Traffic: Match volume and characteristics of actual user traffic.

Real Environment: Execute tests directly in the online environment.

Advance Execution: Run tests before traffic surges.

The approach separates the concept of a "full link" (the complete request path) from "load testing" (massive user‑behavior simulation).

Challenges

Distributed system characteristics such as uncertainty, jitter, and queueing behavior make it difficult to model traffic accurately. For example, throughput is theoretically linear with load ( Throughput = f(load)), but real‑world jitter caused by network and disk variability breaks this relationship.

Queueing effects cause response time to explode as the system approaches saturation, leading to resource exhaustion.

Gaode’s business-specific factors (region, terrain, road conditions, network density, season, weather, government activities) further complicate traffic modeling.

Platform Motivation

Existing platforms (e.g., Alibaba’s Amazon) could not meet Gaode’s cost, flexibility, and visualization requirements, prompting the development of a self‑built testing platform (TestPG).

Design Goals

Ensure scenario realism (protocol support, user‑behavior reconstruction from logs).

Isolate test traffic from production users.

Generate ultra‑high traffic using a distributed JMeter cluster.

Reduce usage and resource costs via rapid provisioning and multi‑tenant resource scheduling.

Key Features of TestPG

Rapid testing: one‑click configuration of URL, request type, fields, QPS, and duration.

Debugging: shield mode (no real service calls) and service mode (limited real calls with detailed request/response logs).

Error localization: automatic capture and formatting of abnormal requests.

Comprehensive reports with QPS, RT statistics, error rates, baseline comparison, and real‑time charts.

Future Directions

Full‑link monitoring integrated with EagleEye for automatic topology discovery.

Simplified corpus generation by handling log processing on the platform.

Enriched pressure models (step, jitter, pulse) beyond JMeter’s native capabilities.

Confidence evaluation of test scenarios using feature libraries, geographic coverage, and traffic models.

Extending support to write‑heavy workloads and full traffic isolation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemssystem stabilitycapacity planningLoad Testingperformance engineering
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.