Full-Link Load Testing Platform TestPG: Architecture, Corpus Production, and Intelligent Features
Gaode’s TestPG platform solves full‑link load‑testing bottlenecks by unifying traffic capture with Iflow, converting logs into standardized corpora via a Flink pipeline, and applying corpus‑intelligence that extracts seasonal feature statistics and predicts distributions for precise, feature‑level throttling, enabling faster, more reliable testing and future autonomous optimization.
Gaode Map, a national‑level travel service platform with over 100 million daily active users, relies on a massive backend cluster. Failures can have a huge impact, especially given the three‑data‑center, cross‑region deployment that creates complex online environments and links. To avoid user harm, ensure capacity planning, disaster recovery, and rapid incident response, a proactive validation method—full‑link load testing—is required to bring real traffic into the test environment before problems occur.
Full‑link load testing is a crucial means of guaranteeing online service stability for Gaode. The TestPG platform has evolved from its inception to a mature system that supports both routine and full‑link testing, achieving rapid and accurate testing goals. Corpus production (traffic processing) is a key step, and this document focuses on that aspect.
Related article: Gaode Full‑Link Load Testing Platform TestPG Architecture and Practice
A full‑link test can be summarized in three steps: pre‑test traffic processing (corpus production) , pressure model activation during the test , and post‑test result analysis and issue定位 . The pre‑test traffic processing is the most time‑consuming part. Previously, operations collected logs and handed them to testers to script, which was costly, slow, and prone to request expiration issues. TestPG standardized corpus format and traffic processing flow, but two main problems remained:
Lack of unified control over the corpus production process. Although the format is standardized, each business line handled traffic independently, leading to high production cost.
Interface‑level precise throttling cannot meet needs. Traffic varies with weather, geography, holidays, etc. For example, long‑distance navigation during holidays requires non‑linear backend compute resources, but the platform only supports interface‑level throttling, not feature‑level control.
To address these issues, Gaode’s full‑link testing team launched a corpus intelligence project.
Solution Approach
Standardized Traffic Capture
Gaode’s full‑link testing had already integrated most business lines, but traffic sources (logs, ODPS, raw traffic) were heterogeneous. The team decided to unify traffic capture by using a low‑cost, high‑efficiency method: traffic recording. Existing recording solutions were fragmented, causing instability. Therefore, a unified traffic capture platform (Iflow) was built.
Platform‑Based Corpus Production
After standardizing traffic capture, the next step is to convert raw traffic into platform‑compliant corpus. Because each business has unique requirements, a fully custom solution would be costly. The team chose Apache Flink to implement a flexible pipeline: business‑specific logic is expressed via user‑defined functions (UDFs), while common processing (e.g., feature extraction, storage) is handled by Flink sink plugins.
Corpus Intelligence
Gaode’s traffic exhibits strong seasonal patterns (e.g., longer navigation routes during holidays). To achieve feature‑level precise throttling, the platform must provide realistic traffic feature distributions. Corpus intelligence follows three stages:
1. Traffic Feature Statistics : Identify parameters that affect traffic (weather, holidays, etc.) and collect their distributions.
2. Traffic Feature Extraction : Extract these parameters during the corpus production stage, leveraging the unified Flink pipeline.
3. Intelligent Prediction & Machine Learning : Use historical peak‑traffic data (e.g., past National Day, Spring Festival) combined with current trends to predict feature distributions for upcoming peaks. Future work includes automated discovery of influencing parameters via machine learning.
Overall Architecture
The capture platform (Iflow) records traffic, caches it in Kafka, and writes it to ODPS. The corpus production service reads from ODPS, processes traffic with Flink, and stores the resulting corpus in OSS.
Core Components
Iflow Capture Platform
Iflow manages traffic capture tasks. Traffic is copied via plugins, cached in Kafka, and persisted to ODPS. Tasks require approval, reducing the risk of accidental incidents.
TestPG Corpus Intelligence
The system consists of three modules: business line management, test list management, and interface ratio management. Each business line represents a testing chain from capture to corpus production and feature analysis.
Key functions include:
Linking capture tasks to corpus production.
Automatic triggering of corpus production after capture completes.
Generating corpus paths for each run.
Managing HTTP headers for test requests.
Manual or automatic triggering of corpus production.
Test list management automatically registers newly discovered interfaces during capture, categorizing them as testable, exempt, or pending improvement, thus driving full‑link coverage.
Interface ratio management initially uses BI‑provided ratios; later, extracted traffic features will feed intelligent predictions to produce realistic ratio data for testing.
Platform Advantages
Corpus Platformization
The production pipeline integrates the capture platform and Flink, supporting both custom business logic (via UDFs) and common processing (via Flink sink). This greatly improves efficiency and quality of corpus generation, moving from format standardization to end‑to‑end process standardization.
Corpus Intelligence
Flink plugins collect and aggregate feature parameters during production. The accumulated statistics enable downstream intelligent analysis and prediction, achieving feature‑level precise throttling for truly realistic full‑link testing.
Future Outlook
Future work will apply machine‑learning models to historical feature data to predict traffic characteristics for upcoming peak periods, automatically discover influencing parameters, and provide confidence‑evaluation systems that compare predicted vs. real traffic features, continuously improving accuracy. Combined with precise throttling, pressure models, and monitoring, the platform aims for fully autonomous, intelligent full‑link testing.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.