Netflix’s Service‑Level Priority Load Shedding: Protecting User‑Initiated Requests
This article explains how Netflix extended its priority load‑shedding strategy from the API gateway to individual services, detailing the classification of user‑initiated versus pre‑fetch requests, the implementation of partitioned concurrency limiters, CPU‑ and I/O‑based shedding, test results, and real‑world impact on availability.
0 Introduction
In November 2020 Netflix introduced priority load shedding at the Zuul API gateway to keep critical playback traffic ahead of telemetry. This article explores extending that concept to the service layer, especially the video streaming control plane, to improve user experience and system resilience.
1 Evolution of Netflix Load Shedding
Initially implemented in Zuul, the mechanism allowed fine‑grained control of traffic priority. Applying the same logic inside services offers benefits: teams control their own priority rules, it works for internal‑to‑internal communication, and a single cluster can handle both request types, reducing cloud resource waste.
2 Service‑Level Priority Load Shedding
PlayAPI, the backend handling playlist and license requests, distinguishes two request classes:
User‑initiated requests (critical): sent when a user clicks play and directly affect playback.
Pre‑fetch requests (non‑critical): issued optimistically while browsing; failures only increase start‑up latency.
2.1 Problem
Using a single concurrency limiter reduced availability for both request types during traffic spikes or backend latency, causing pre‑fetch peaks to throttle user‑initiated traffic and increasing latency for all requests when capacity was insufficient.
Sharding into separate clusters solves the issue but adds operational overhead (CI/CD, autoscaling, metrics, alerts).
2.2 Solution
Netflix/concurrency‑limits Java library’s partition feature was used to create two partitions inside a servlet filter:
User‑initiated partition: guarantees 100 % throughput.
Pre‑fetch partition: consumes only surplus capacity.
Filter code example:
Filter filter = new ConcurrencyLimitServletFilter(
new ServletLimiterBuilder()
.named("playapi")
.partitionByHeader("X-Netflix.Request-Name")
.partition("user-initiated", 1.0)
.partition("pre-fetch", 0.0)
.build());The filter determines request priority from an HTTP header, avoiding body parsing and ensuring the limiter itself is not a bottleneck.
2.3 Testing
Fault‑injection tests added a 2 s delay to pre‑fetch calls (baseline p99 < 200 ms). A baseline instance without priority shedding showed equal availability loss for both request types, while a canary with priority shedding kept user‑initiated availability > 99.4 % and reduced pre‑fetch availability to ~20 %.
2.4 Real‑World Results
During an infrastructure outage, Android pre‑fetch RPS spiked 12×. Priority shedding limited pre‑fetch availability to 20 % but kept user‑initiated availability above 99.4 %.
3 Generic Service Work Priority
An internal library provides four priority buckets inspired by Linux tc‑prio:
CRITICAL : core functionality, never shed.
DEGRADED : user‑experience impact, shed gradually.
BEST_EFFORT : non‑impacting work, shed as needed.
BULK : background tasks, regularly shed.
Services map incoming requests to buckets via headers or request attributes. Example provider code:
ResourceLimiterRequestPriorityProvider requestPriorityProvider() {
return contextProvider -> {
if (contextProvider.getRequest().isCritical()) {
return PriorityBucket.CRITICAL;
} else if (contextProvider.getRequest().isHighPriority()) {
return PriorityBucket.DEGRADED;
} else if (contextProvider.getRequest().isMediumPriority()) {
return PriorityBucket.BEST_EFFORT;
} else {
return PriorityBucket.BULK;
}
};
}3.1 CPU‑Based Load Shedding
Most Netflix services auto‑scale on CPU utilization. After mapping requests to buckets, services start shedding from lower‑priority buckets once target CPU (e.g., 60 %) is exceeded, preserving capacity for critical traffic.
3.2 CPU‑Based Experiments
Experiments forced a service to stay at a 45 % CPU target, then injected load to reach 60 % (non‑critical shed) and 80 % (critical shed). Even at 6× autoscaling capacity, latency remained acceptable and successful RPS stayed stable.
3.3 Anti‑Patterns
Two pitfalls were identified:
No shedding: latency grows for all requests, risking “death spiral”.
Over‑shedding (bloody failure): aggressive shedding drops successful RPS below pre‑shedding levels.
The CPU‑based experiments demonstrated avoidance of both anti‑patterns.
4 IO‑Based Load Shedding
For services limited by backend storage latency, Netflix introduced latency‑based metrics. Services can declare target and max latency per endpoint, and a utilization metric reports the percentage of requests exceeding those targets. The limiter then sheds low‑priority work when utilization crosses thresholds.
Example utilization metric:
utilization(namespace) = {
overall = 12
latency = {
slo_target = 12,
slo_max = 0
}
system = {
storage = 17,
compute = 10,
}
}A key‑file source service used this approach to protect critical writes while shedding low‑priority reads when storage utilization approached capacity, verified by a 50 Gbps read‑stress test.
5 Conclusion
Service‑level priority load shedding has proven essential for maintaining high availability and excellent user experience at Netflix, even under unexpected system pressure.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
