Backend Development 18 min read

Netflix’s Service‑Level Priority Load Shedding: Protecting User‑Initiated Requests

This article explains how Netflix extended its priority load‑shedding strategy from the API gateway to individual services, detailing the classification of user‑initiated versus pre‑fetch requests, the implementation of partitioned concurrency limiters, CPU‑ and I/O‑based shedding, test results, and real‑world impact on availability.

JavaEdge

Dec 8, 2024

Netflix’s Service‑Level Priority Load Shedding: Protecting User‑Initiated Requests

0 Introduction

In November 2020 Netflix introduced priority load shedding at the Zuul API gateway to keep critical playback traffic ahead of telemetry. This article explores extending that concept to the service layer, especially the video streaming control plane, to improve user experience and system resilience.

1 Evolution of Netflix Load Shedding

Initially implemented in Zuul, the mechanism allowed fine‑grained control of traffic priority. Applying the same logic inside services offers benefits: teams control their own priority rules, it works for internal‑to‑internal communication, and a single cluster can handle both request types, reducing cloud resource waste.

2 Service‑Level Priority Load Shedding

PlayAPI, the backend handling playlist and license requests, distinguishes two request classes:

User‑initiated requests (critical): sent when a user clicks play and directly affect playback.

Pre‑fetch requests (non‑critical): issued optimistically while browsing; failures only increase start‑up latency.

2.1 Problem

Using a single concurrency limiter reduced availability for both request types during traffic spikes or backend latency, causing pre‑fetch peaks to throttle user‑initiated traffic and increasing latency for all requests when capacity was insufficient.

Sharding into separate clusters solves the issue but adds operational overhead (CI/CD, autoscaling, metrics, alerts).

2.2 Solution

Netflix/concurrency‑limits Java library’s partition feature was used to create two partitions inside a servlet filter:

User‑initiated partition: guarantees 100 % throughput.

Pre‑fetch partition: consumes only surplus capacity.

Filter code example:

Filter filter = new ConcurrencyLimitServletFilter(
        new ServletLimiterBuilder()
                .named("playapi")
                .partitionByHeader("X-Netflix.Request-Name")
                .partition("user-initiated", 1.0)
                .partition("pre-fetch", 0.0)
                .build());

The filter determines request priority from an HTTP header, avoiding body parsing and ensuring the limiter itself is not a bottleneck.

2.3 Testing

Fault‑injection tests added a 2 s delay to pre‑fetch calls (baseline p99 < 200 ms). A baseline instance without priority shedding showed equal availability loss for both request types, while a canary with priority shedding kept user‑initiated availability > 99.4 % and reduced pre‑fetch availability to ~20 %.

2.4 Real‑World Results

During an infrastructure outage, Android pre‑fetch RPS spiked 12×. Priority shedding limited pre‑fetch availability to 20 % but kept user‑initiated availability above 99.4 %.

3 Generic Service Work Priority

An internal library provides four priority buckets inspired by Linux tc‑prio:

CRITICAL : core functionality, never shed.

DEGRADED : user‑experience impact, shed gradually.

BEST_EFFORT : non‑impacting work, shed as needed.

BULK : background tasks, regularly shed.

Services map incoming requests to buckets via headers or request attributes. Example provider code:

ResourceLimiterRequestPriorityProvider requestPriorityProvider() {
    return contextProvider -> {
        if (contextProvider.getRequest().isCritical()) {
            return PriorityBucket.CRITICAL;
        } else if (contextProvider.getRequest().isHighPriority()) {
            return PriorityBucket.DEGRADED;
        } else if (contextProvider.getRequest().isMediumPriority()) {
            return PriorityBucket.BEST_EFFORT;
        } else {
            return PriorityBucket.BULK;
        }
    };
}

3.1 CPU‑Based Load Shedding

Most Netflix services auto‑scale on CPU utilization. After mapping requests to buckets, services start shedding from lower‑priority buckets once target CPU (e.g., 60 %) is exceeded, preserving capacity for critical traffic.

3.2 CPU‑Based Experiments

Experiments forced a service to stay at a 45 % CPU target, then injected load to reach 60 % (non‑critical shed) and 80 % (critical shed). Even at 6× autoscaling capacity, latency remained acceptable and successful RPS stayed stable.

3.3 Anti‑Patterns

Two pitfalls were identified:

No shedding: latency grows for all requests, risking “death spiral”.

Over‑shedding (bloody failure): aggressive shedding drops successful RPS below pre‑shedding levels.

The CPU‑based experiments demonstrated avoidance of both anti‑patterns.

4 IO‑Based Load Shedding

For services limited by backend storage latency, Netflix introduced latency‑based metrics. Services can declare target and max latency per endpoint, and a utilization metric reports the percentage of requests exceeding those targets. The limiter then sheds low‑priority work when utilization crosses thresholds.

Example utilization metric:

utilization(namespace) = {
  overall = 12
  latency = {
    slo_target = 12,
    slo_max = 0
  }
  system = {
    storage = 17,
    compute = 10,
  }
}

A key‑file source service used this approach to protect critical writes while shedding low‑priority reads when storage utilization approached capacity, verified by a 50 Gbps read‑stress test.

5 Conclusion

Service‑level priority load shedding has proven essential for maintaining high availability and excellent user experience at Netflix, even under unexpected system pressure.

backend architecture service reliability priority Netflix load shedding concurrency limits

Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.