Backend Development 12 min read

Analyzing and Optimizing RPC Framework Latency: A Case Study of Long‑Tail Effects and Elastic Timeout Solution

This article investigates why an RPC interface with an average execution time of 1.5 ms still experiences numerous 100 ms+ timeouts, analyzes the root causes such as GC and I/O jitter, and proposes an elastic timeout optimization to improve service reliability.

Zhuanzhuan Tech

Jan 14, 2025

Analyzing and Optimizing RPC Framework Latency: A Case Study of Long‑Tail Effects and Elastic Timeout Solution

1 Background

On a pleasant spring afternoon, a colleague from the Customer Service Technology Department (P) reported an issue to us:

P: Yun Jie, we are currently improving service quality, but one of our interfaces cannot meet the company’s 5‑nine success rate standard.

Me: Great, tell me more.

P: Our <em>lookupWarehouseIdRandom</em> interface in <em>sccis</em> first checks the cache, then queries the database on a miss and writes back to the cache. The average execution time is only 1.5 ms. However, <em>scoms</em> calls it with a timeout of 100 ms and we still see more than 500 timeouts per day, failing the 5‑nine standard. Is there a problem with the framework?

Me: That seems unlikely – the average is 1.5 ms, and the timeout is set to 100 ms (over 60× the average)!

P: It’s true!! Look at the data yourself!!!

Me: Let’s see.

The investigation begins here.

2 Verification and Analysis

2.1 Preparation

Before verification, we briefly introduce the call process of the Zanzhuan RPC framework SCF, as shown in the diagram below:

Serialization : SCF receives the request, performs load balancing and serialization.

Send : SCF sends the serialized binary stream over the network to the service node.

Deserialization : The service node receives the data, deserializes it and queues the request.

Execution : SCF forwards the request to the service implementation for processing.

Serialization : The service serializes the result into a binary stream.

Return : The data is sent back to the caller.

Deserialization : The caller SCF deserializes the binary data into an object, making the remote call appear like a local method call.

The above describes a complete RPC call chain.

2.2 Verification

Monitoring shows that the interface’s average execution time is indeed around 1.5 ms:

However, with a caller timeout of 100 ms, many requests still time out:

It is shocking!

2.3 Problem Analysis

From the RPC call chain we can see that any sub‑process may jitter and cause timeouts. We split the chain into two parts: the framework (network, SCF overhead – steps 1,2,3,5,6,7) and the business logic (step 4).

Framework : Objective causes such as network latency and SCF processing.

Business : Subjective causes originating from the actual service implementation.

Because framework latency is complex and hard to measure, we monitor the distribution of business execution times to determine where the problem lies.

If business execution times are uniformly low, the timeout is likely caused by the framework.

If many business calls have high latency, the issue resides in the business logic.

Monitoring the service shows that most requests finish within 5 ms, but 314 requests exceed 100 ms:

The average execution time is 1.5 ms, yet many requests take over 100 ms. Where does this extra time go?

2.4 Investigation

Current monitoring only reports overall latency, so we instrument the interface to break it into several stages:

The monitoring results are shown below:

From the data we observe:

I/O operations are jittery, often exceeding 100 ms.

Simple CPU operations rarely exceed 100 ms but frequently spike to around 20 ms, jumping abruptly from 1 ms.

2.5 Root Cause

The average of 1.5 ms masks a long‑tail distribution. Possible reasons include GC pauses, CPU time‑slice allocation, etc. The following GC monitoring chart for sccis illustrates this:

We also compared the zzproduct service’s getProductById() interface and observed a similar latency distribution:

3 Solution

In summary, although the average execution time of the business interface is only 1.5 ms, a significant number of requests exceed 100 ms due to long‑tail effects caused by GC, CPU time‑slice allocation, network jitter, etc.

To meet the company’s 5‑nine reliability requirement, we can either increase the timeout to match the 99.999th percentile (e.g., 123 ms) or optimize the business logic/JVM. Adjusting the timeout directly is the simplest approach.

3.1 Framework Optimization – Elastic Timeout

Based on the analysis, the RPC framework can implement an elastic timeout mechanism: without changing the configured 100 ms timeout, allow a configurable number of requests within a configurable time window to extend to a higher latency (e.g., 200 ms). This improves service quality while minimally impacting user experience.

3.1.1 Effect

In the service management console we configure elastic timeout per service and function. For the call IInventoryWrapCacheFacade.lookupWarehouseIdRandom(List) we allow 15 requests every 40 seconds to have their timeout extended to 1300 ms:

After enabling elastic timeout, the sporadic timeouts are essentially eliminated:

3.1.2 Applicable Scenarios

Elastic timeout is suitable for occasional timeout scenarios such as network jitter, GC pauses, CPU spikes, or cold starts. For widespread timeout issues, deeper analysis and remediation are still required.

4 Conclusion

This article deeply analyzes why an interface with an average latency of 1.5 ms can still produce many 100 ms+ outliers, and proposes an elastic timeout solution at the framework level. The findings highlight that even seemingly trivial operations (e.g., i++) can suffer from occasional long latencies due to GC, CPU scheduling, and other factors.

About the Author

Du Yunjie, Senior Architect, Head of Zanzhuan Architecture Department, Executive Chair of Zanzhuan Technical Committee, Tencent Cloud TVP. Responsible for service governance, MQ, cloud platform, APM, distributed tracing, monitoring systems, configuration center, distributed task scheduling, ID generation, distributed locks, and other core components. WeChat: waterystone. Open to constructive exchanges.

The road is long; embrace change. Keep moving forward.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend RPC Latency elastic timeout

Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.