Operations 8 min read

Building a Custom RPC Stress‑Testing Tool: Insights from Meituan

Meituan’s internal RPC services, largely built on Thrift, required a streamlined pressure‑testing solution, leading to the development of a custom tool that automates traffic capture, provides an intuitive UI, aggregates metrics via InfluxDB, and supports both Thrift and HTTP workloads, addressing the shortcomings of existing open‑source options.

ITPUB
ITPUB
ITPUB
Building a Custom RPC Stress‑Testing Tool: Insights from Meituan

Background

Most of Meituan’s internal RPC services are built on Apache Thrift. During routine development, engineers need to perform pressure (stress) testing to uncover potential issues. Existing approaches—writing custom scripts in Python or Ruby to replay logs, or using generic open‑source tools—are time‑consuming, error‑prone, and lack unified reporting.

Problems with Existing Solutions

Heavy code effort to parse logs and reconstruct requests, especially for complex Thrift payloads.

Setup of scripting environments or third‑party tools consumes significant time.

Inconsistent result formats; many tools output raw terminal data that is hard to interpret.

Difficulty sharing test configurations across teams due to environment and code differences.

Evaluation of Open‑Source Tools

We examined several popular stress‑testing frameworks:

JMeter – excellent for HTTP but lacks native Thrift support, requires complex local installation, and is not user‑friendly for our use case.

Twitter’s iago – supports HTTP and Thrift but forces a project per test, presents non‑intuitive results, depends on an outdated Scala version, and has sparse documentation.

Other tools such as Gatling, Grinder, and Locust were also considered but did not align with Meituan’s specific requirements.

Given these gaps, building a simple, easy‑to‑use internal tool became necessary.

Design Goals

Capture live traffic from production services.

Provide an intuitive UI that enables test setup within an hour.

Display clear charts for key performance metrics.

Support both Thrift and HTTP services.

Architecture Overview

The tool follows a three‑stage lifecycle: init (resource preparation such as DB connections and client creation), run (multi‑threaded request generation while recording timestamps to compute response times, TP90, average, max, etc.), and destroy (resource cleanup). The core interface abstracts these stages, allowing developers to implement service‑specific runners while the framework handles concurrency and result aggregation.

Interface diagram
Interface diagram

Traffic Capture (VCR)

To simplify replaying real production traffic, the tool introduces a VCR (Video Cassette Recorder) component. VCR serializes incoming requests into JSON and stores them in Redis using a single‑threaded asynchronous writer, minimizing impact on the live service.

Traffic capture flow
Traffic capture flow

Data Aggregation

After a test run, the tool aggregates metrics such as maximum response time, average response time, QPS, TP90, and TP50. InfluxDB is used as the time‑series backend, enabling simple queries—for example, a one‑line InfluxQL statement can retrieve TP90 values.

InfluxDB query example
InfluxDB query example

Implementation Details

The tool is packaged as a Maven artifact, making it easy for Java‑based services to consume. Users only need two lines of code to start traffic capture. For isolation, a dedicated machine performs the capture to avoid affecting production latency.

Maven usage example
Maven usage example

After capture, a web UI lets users inspect collected logs and view details of individual requests.

Web UI for log inspection
Web UI for log inspection

Performance Metrics Visualization

Results are presented through intuitive charts, allowing users to quickly assess application stability and performance under load.

Performance dashboard
Performance dashboard

Conclusion

Since its launch, the custom stress‑testing tool has been adopted by over 20 services, executing hundreds of test runs. Setup time has been reduced to 15–30 minutes per application, improving service stability and freeing developers from cumbersome manual testing processes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsRPCstress testingBackend ToolsThrift
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.