How Netflix’s New Load‑Balancing Algorithm Cuts Errors by Orders of Magnitude

Netflix’s cloud‑gateway team redesigned Zuul’s load‑balancing using a combination of client latency, server utilization, choice‑of‑2 and Join‑the‑Shortest‑Queue algorithms, adding server‑reported metrics, adaptive thresholds and statistical decay, which dramatically reduced error rates, latency and improved traffic distribution in production.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How Netflix’s New Load‑Balancing Algorithm Cuts Errors by Orders of Magnitude

Goal

Netflix’s cloud‑gateway team aims to reduce errors, increase availability and improve fault‑recovery because the service handles over a million requests per second, so even tiny error rates affect members.

Background

Zuul originally used a round‑robin Ribbon load balancer with a blacklist for high‑failure servers. Recent customizations sent less traffic to newly launched servers, but some original clusters still showed error rates far above expectations, especially when only a subset of servers was overloaded (cold‑start, GC spikes, hardware issues).

Guiding Principles

Work within existing load‑balancer framework constraints.

Learn from other teams.

Avoid distributed state; prefer local decisions.

Avoid client‑side configuration and manual tuning.

Prefer adaptive mechanisms over static thresholds.

Load‑Balancing Approach

The best latency data comes from the client view, while utilization data comes from the server itself. Combining both yields the most effective balancing.

Choice‑of‑2 algorithm for server selection.

Primary balancing using a server‑utilization view.

Secondary balancing using a server‑view based on utilization.

Probation and server‑generation mechanisms to protect newly started servers.

Statistical decay to zero over time.

We adopt Join‑the‑Shortest‑Queue (JSQ) together with server‑reported utilization as a second factor.

JSQ Issues

JSQ works well for a single balancer but causes herd behavior across a balancer cluster, leading to overload cycles. Combining JSQ with choice‑of‑2 mitigates this.

Local JSQ views can be misleading when many balancers exist.

Single‑balancer view may differ from reality; for example, balancer A sees no requests to server Y while other balancers are heavily using Y, causing a wrong selection.

Server Utilization Reporting

Servers actively report utilization, giving every balancer a complete picture.

Two implementation options:

Active health‑check endpoint polling each server’s current utilization.

Passive tracking of responses to infer utilization.

We chose the passive approach for simplicity and low overhead.

X-Netflix.server.utilization: <current-utilization>[, target=<target-utilization>]

Scoring and Selection

Each server receives a score based on three factors: client health (error‑rate rolling percentage), server utilization (latest reported value), and client utilization (active request count from the current balancer). The total score determines the winning server.

Filtering

When randomly picking two servers for comparison, any server exceeding safe utilization or health thresholds is filtered out. The filter runs per request with a best‑effort N‑try loop before falling back to an unfiltered server.

Probation and Generation

Servers that have not yet responded to a balancer receive only one active request (probation) until they report utilization. Server generation is used to gradually ramp traffic to newly started servers during the first 90 seconds.

Statistical Decay

All statistics used for balancing decay linearly to zero over 30 seconds, preventing permanent blacklisting.

Operational Impact

Removing round‑robin in favor of the new algorithm increases load‑distribution variance but greatly reduces error rates and latency. The system adapts to slower servers by sending them less traffic, which can affect deployment strategies such as red‑black or canary releases.

Synthetic Load‑Test Results

New balancer reduces load‑related errors by several orders of magnitude compared with round‑robin.

Average and tail latency improve three‑fold.

Server‑side error rates drop by an order of magnitude.

Impact on Production Traffic

The new balancer efficiently routes traffic to healthy servers, automatically handling intermittent or sustained degradations without manual intervention, reducing the need for engineers to respond to incidents at odd hours.

Conclusion

This work is shared with the broader proxy/service‑mesh/load‑balancing community. The combination of client latency, server‑reported utilization, choice‑of‑2, JSQ, adaptive thresholds and decay proved effective at Netflix scale, dramatically lowering load‑related error rates and improving traffic distribution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsperformanceNetflix
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.