How 1% Slow Requests Cause 63% Latency: Mastering Tail Latency in Distributed Systems
This article explains why a tiny fraction of slow requests can dominate overall service latency, explores their causes—especially slow nodes on the query path—and shows how proper tracing and analysis can improve response times and cut infrastructure costs.
1. Tail Requests
When discussing response time, people often think of the average, but large internet companies focus on the 99th percentile, making the 99% request metric crucial for SLA definition.
The 99% request metric is crucial for SLA; see the earlier article on SLI, SLO, and SLA for more details.
Excluding the 99%, the remaining 1% are tail requests (slow requests) defined as responses taking longer than one second.
Common causes in search engines include:
<code>1. Rare query terms causing disk access;
2. Queries matching too many results leading to longer processing;
3. Malicious complex query combinations;
4. Slow nodes on the query path.</code>This article focuses on the fourth cause: slow nodes on the query path.
Analyzing tail requests improves overall response time and saves cost.
1) Improving overall response time
In distributed systems, a request may call many other services. Optimizing tail latency can improve the whole service's response time.
Assume only 1% of service latencies exceed 1 s. If the front‑end queries 100 instances, over 63% of requests will see latency >1 s.
Even if only 0.01% exceed 1 s, with 2000 instances the proportion rises above 18%.
From Jeff Dean: Tail At Scale
2) Saving cost
SLAs aim for sub‑second response times; tail latency impacts revenue by reducing ad clicks and user conversion. Reducing tail latency often requires better resource utilization rather than adding more machines.
Improving resource usage by 1% can save millions of dollars in data‑center costs.
2. Analysis Tools
Not all profilers are suitable for tail latency. Sampling profilers like perf aggregate data into averages, making it hard to spot rare slow requests.
For example, a typical search request involves hundreds of machines, as shown in the diagram.
The green thick line is the incoming RPC, the thin green line is the first fan‑out, and the blue lines are the second fan‑out; after the second fan‑out, additional RPCs to leaf nodes are omitted. Each leaf node spends 1–2 ms processing before results are aggregated.
Because of massive east‑west traffic, network bandwidth requirements are unprecedented, and performance analysis tools must meet new challenges.
Google’s Dapper provides end‑to‑end tracing but still struggles with unpredictable slow requests.
Google engineers built a trace system that captures all timestamps, using less than 1% CPU.
3. Case Study
Slow RPC
In a search request, one slow RPC can dominate total latency.
Only a few RPCs are slow; aggregated data alone is insufficient.
The trace system runs continuously, using only 1% CPU.
Examining a slow RPC on a 16‑CPU machine handling 40 RPCs shows most time spent waiting for a lock.
Thread‑level view shows the thread waiting on the same lock, caused by CPU‑affinity scheduling.
Result: CPU affinity leads to threads waiting for the previously used CPU, trading cache locality for increased latency.
Conclusion
Traditional sampling profilers cannot trace such complex performance issues.
The trace framework captures enough information with under 1% CPU overhead, but requires deep hardware knowledge and careful software design to avoid impacting system performance.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.