Design and Practice of Autohome's Performance Testing Platform PTS
The article details the architecture, key components, testing types, and operational results of Autohome's PTS platform, which uses Docker Swarm, gRPC, JMeter, Flume‑Kafka, and Flink to conduct large‑scale distributed load testing for the 818 event and outlines future improvements toward Kubernetes and direct Kafka logging.
Autohome's 818 global live event generates massive traffic spikes, prompting the cloud platform team to develop a full‑link performance testing platform (PTS) to evaluate and mitigate service risks before the event.
PTS consists of core modules such as user interface, scheduling service, resource management, log collection, metric calculation, and reporting, providing distributed load testing by replaying traffic to simulate real‑world user scenarios.
Architecture Overview : Users register test scenarios, which are dispatched to the scheduling service that acquires Docker Swarm containers. Each container runs a JMeter instance; logs are written to a shared directory, collected by Flume, sent to Kafka, and processed by Flink to compute metrics stored in Redis for real‑time monitoring.
Docker Unit : A Docker image serves as a test unit containing an RPC service (implemented with gRPC and protobuf) for status reporting and a JMeter package for generating load. The RPC module offers APIs such as send_conf_file() , send_jar_file() , exec_command() , async_exec_command() , and run_jmeter() .
Log Specification : Logs are stored under /data/log with filenames composed of timestamps and task IDs, facilitating collection and troubleshooting. Both Nginx‑style string logs and JSON gateway logs are supported.
Docker Swarm Management : Swarm provides high‑availability multi‑master clusters; during the 818 event, 270 VMs across two data centers hosted up to 2,000 concurrent containers, achieving peak QPS of 5 million without failures.
Traffic Capture : Two log formats (Nginx and JSON) are consumed from Kafka; capture tasks filter by user criteria and support count‑based or duration‑based strategies.
Testing Types : PTS supports four test types—custom request, gRPC, native JMeter (.jmx) files, and traffic replay—each with specific configuration requirements.
Distributed Load Strategy : One container handles up to 1,000 threads; for larger thread counts, the system calculates the required number of containers, requests them from Swarm, initializes environments via gRPC, and starts the load once resources are ready.
Reporting : FlinkSQL provides second‑level metric updates, enabling real‑time view of interface pressure and immediate anomaly handling.
Practice Results : Over ten pre‑event test rounds, more than 200 scenarios per round were executed, with single‑scenario QPS reaching 1 million and a 100 % success rate across a 270‑node cluster. Full‑link monitoring and rapid incident response ensured stability.
Conclusion and Future Work : Since launch, PTS has supported daily testing across multiple business lines and successfully handled the 818 event. Planned improvements include migrating Swarm to Kubernetes for better resource utilization, eliminating log persistence by writing directly to Kafka, and adding support for multi‑region pressure, dynamic host configuration, and other advanced scenarios.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.