How Netflix’s New Load‑Balancing Algorithm Cuts Errors by Orders of Magnitude
Netflix’s cloud‑gateway team redesigned Zuul’s load‑balancing using a combination of client latency, server utilization, choice‑of‑2 and Join‑the‑Shortest‑Queue algorithms, adding server‑reported metrics, adaptive thresholds and statistical decay, which dramatically reduced error rates, latency and improved traffic distribution in production.
Goal
Netflix’s cloud‑gateway team aims to reduce errors, increase availability and improve fault‑recovery because the service handles over a million requests per second, so even tiny error rates affect members.
Background
Zuul originally used a round‑robin Ribbon load balancer with a blacklist for high‑failure servers. Recent customizations sent less traffic to newly launched servers, but some original clusters still showed error rates far above expectations, especially when only a subset of servers was overloaded (cold‑start, GC spikes, hardware issues).
Guiding Principles
Work within existing load‑balancer framework constraints.
Learn from other teams.
Avoid distributed state; prefer local decisions.
Avoid client‑side configuration and manual tuning.
Prefer adaptive mechanisms over static thresholds.
Load‑Balancing Approach
The best latency data comes from the client view, while utilization data comes from the server itself. Combining both yields the most effective balancing.
Choice‑of‑2 algorithm for server selection.
Primary balancing using a server‑utilization view.
Secondary balancing using a server‑view based on utilization.
Probation and server‑generation mechanisms to protect newly started servers.
Statistical decay to zero over time.
We adopt Join‑the‑Shortest‑Queue (JSQ) together with server‑reported utilization as a second factor.
JSQ Issues
JSQ works well for a single balancer but causes herd behavior across a balancer cluster, leading to overload cycles. Combining JSQ with choice‑of‑2 mitigates this.
Local JSQ views can be misleading when many balancers exist.
Single‑balancer view may differ from reality; for example, balancer A sees no requests to server Y while other balancers are heavily using Y, causing a wrong selection.
Server Utilization Reporting
Servers actively report utilization, giving every balancer a complete picture.
Two implementation options:
Active health‑check endpoint polling each server’s current utilization.
Passive tracking of responses to infer utilization.
We chose the passive approach for simplicity and low overhead.
X-Netflix.server.utilization: <current-utilization>[, target=<target-utilization>]Scoring and Selection
Each server receives a score based on three factors: client health (error‑rate rolling percentage), server utilization (latest reported value), and client utilization (active request count from the current balancer). The total score determines the winning server.
Filtering
When randomly picking two servers for comparison, any server exceeding safe utilization or health thresholds is filtered out. The filter runs per request with a best‑effort N‑try loop before falling back to an unfiltered server.
Probation and Generation
Servers that have not yet responded to a balancer receive only one active request (probation) until they report utilization. Server generation is used to gradually ramp traffic to newly started servers during the first 90 seconds.
Statistical Decay
All statistics used for balancing decay linearly to zero over 30 seconds, preventing permanent blacklisting.
Operational Impact
Removing round‑robin in favor of the new algorithm increases load‑distribution variance but greatly reduces error rates and latency. The system adapts to slower servers by sending them less traffic, which can affect deployment strategies such as red‑black or canary releases.
Synthetic Load‑Test Results
New balancer reduces load‑related errors by several orders of magnitude compared with round‑robin.
Average and tail latency improve three‑fold.
Server‑side error rates drop by an order of magnitude.
Impact on Production Traffic
The new balancer efficiently routes traffic to healthy servers, automatically handling intermittent or sustained degradations without manual intervention, reducing the need for engineers to respond to incidents at odd hours.
Conclusion
This work is shared with the broader proxy/service‑mesh/load‑balancing community. The combination of client latency, server‑reported utilization, choice‑of‑2, JSQ, adaptive thresholds and decay proved effective at Netflix scale, dramatically lowering load‑related error rates and improving traffic distribution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
