Improving Nginx Load Balancing with the Virtual Node Smooth Weighted Round‑Robin (VNSWRR) Algorithm in Tengine
This article analyzes the shortcomings of Nginx's native Smooth Weighted Round‑Robin (SWRR) algorithm in large‑scale application gateway scenarios, presents a real‑world case where weight adjustments cause traffic spikes, and introduces the VNSWRR algorithm that achieves O(1) selection time, smoother weight handling, and up to 60% higher QPS performance.
Preface
In Alibaba's seven‑layer traffic entry gateway (Application Gateway) scenario, the official Nginx Smooth Weighted Round‑Robin (SWRR) load‑balancing algorithm can no longer meet the requirements. Tengine implements a new algorithm called Virtual Node Smooth Weighted Round‑Robin (VNSWRR), which not only resolves the defects of SWRR but also improves QPS handling capability by roughly 60% compared with Nginx's official SWRR.
Problem
Tengine's access layer uses a self‑developed dynamic upstream module to discover services at runtime, automatically sensing backend instance scaling, weight adjustments, and health checks. While this enables useful operations such as adjusting a single machine's weight for real‑time traffic shaping, under Nginx's native SWRR algorithm these operations can cause irreversible issues.
In the Application Gateway scenario, increasing a machine's weight causes its QPS to surge instantly; the diagram shows App2‑host‑A’s QPS spiking when its weight is set to 2.
SWRR’s time complexity is O(N); with a large number of backends, Nginx’s processing capacity degrades linearly.
Therefore, optimizing Tengine’s load‑balancing strategy and performance is urgent.
Native SWRR Algorithm Analysis
SWRR (Smooth Weighted Round‑Robin) adds a "smooth" property to traditional weighted round‑robin. The following simple example illustrates the algorithm.
Assume three servers A, B, C with weights 5, 1, 1. The array s holds the servers, n is the number of servers, each server’s current weight cw is initialized to 0, effective weight ew equals its configured weight, tw is the sum of all ew in the current round, and best is the selected server. The pseudo‑code is:
best = NULL;
tw = 0;
for(i = random() % n; i != i || flag; i = (i + 1) % n) {
flag = 0;
s[i].cw += s[i].ew;
tw += s[i].ew;
if(best == NULL || s[i].cw > best->cw) {
best = &s[i];
}
}
best->cw -= tw;
return best;The selection order for the example is {A, A, B, A, C, A, A}, whereas a plain WRR would produce {C, B, A, A, A, A, A}. SWRR’s advantages are smoothness and dispersion.
Weight Increase Causing a Disaster
In practice, when a machine’s weight is raised from 1 to 2, the traffic to that machine instantly jumps to about half of the total traffic, then gradually returns to the expected proportion. The cause is that Tengine dynamically perceives weight changes, but SWRR selects the machine with the highest weight on the first request after a change, leading to a burst.
Performance Drop at Large Scale
When configuring 2000 backends in an upstream and stress‑testing Nginx, the function ngx_http_upstream_get_peer consumes 39% of CPU because SWRR’s selection complexity is O(N). Each request may iterate close to 2000 times to find the target server.
Test environment:
CPU: Intel(R) Xeon(R) CPU E5‑2682 v4 @ 2.50GHz
Tool:
./wrk -t25 -d5m -c500 'http://ip/t2000'Increasing the number of upstream servers by 500 reduces Nginx’s QPS by about 10% and increases response time (RT) by ~1 ms.
Improved VNSWRR Algorithm
Classic WRR can achieve O(1) time complexity but may cause uneven traffic in small‑flow scenarios. VNSWRR aims to retain SWRR’s smoothness while achieving O(1) selection.
Example: three servers A, B, C with weights 1, 2, 3. N is the number of backend servers, TW is the total weight sum.
Key Points
Virtual nodes are initialized in the exact order of SWRR selection to ensure sufficient dispersion.
Virtual nodes are created in batches at runtime; each batch initializes min(N, max) nodes to avoid concentrated heavy computation.
Algorithm Description
When Tengine starts or detects backend changes, it builds TW virtual nodes but initially only initializes N nodes.
Each process picks a random start position in the virtual‑node list (e.g., position B in the diagram).
Incoming requests iterate from the start position; when reaching the end of the already‑initialized virtual‑node array, the next batch of virtual nodes is initialized. After all virtual nodes are initialized, no further initialization occurs.
This reduces the algorithm’s time complexity from O(N) to O(1) and eliminates the weight‑increase burst problem. The following chart shows smooth QPS growth when a machine’s weight changes from 1 to 2.
Algorithm Effect Comparison
Under the same stress test (wrk, 500 concurrent long‑connections, 2000 upstream servers), Nginx’s SWRR CPU consumption is 39% while VNSWRR’s is only about 0.27% (function ngx_http_upstream_get_VNSWRR).
QPS of VNSWRR is roughly 60% higher than SWRR.
When varying the number of upstream servers, SWRR’s QPS drops ~10% per additional 500 servers and RT increases ~1 ms, while VNSWRR’s QPS and RT remain stable.
Conclusion
Large‑scale traffic exposes Nginx’s limitations; business demands drive new technology, and VNSWRR inherits SWRR’s smoothness while eliminating its defects. By reducing time complexity to O(1), VNSWRR improves QPS handling by about 60% over the official Nginx SWRR algorithm in massive backend scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
