Analysis and Solutions for Load‑Balancing Issues in QLB‑4 Based TFServing Service Calls
The investigation of QLB‑4‑based TFServing calls revealed uneven traffic, stale routing after scaling, and idle servers due to layer‑4 hash routing, leading the team to replace QLB‑4 with a Consul‑driven client‑side load‑balancer that dynamically pools servers, eliminates restarts, and cuts GPU waste.
Load balancing aims to distribute network requests or other workloads evenly across multiple machines, preventing some servers from being overloaded while others remain idle. It can be implemented in software (e.g., Nginx) or hardware.
In iQIYI's content‑understanding platform, TFServing services are accessed via gRPC and deployed on the QAE platform, using QLB‑4 (a fourth‑layer TCP load balancer) as the load‑balancing solution. During large‑scale online inference, three main problems were observed:
QLB‑4 does not achieve true load balancing; traffic is unevenly distributed, causing some servers to receive significantly more requests.
When new service instances appear (including after a restart), existing clients continue to send traffic to the old instances until the client is restarted.
If the number of clients is smaller than the number of server instances, some servers never receive any requests.
Experiments were conducted by deploying both gRPC clients and servers on QAE and measuring inbound network traffic on each server. The results (illustrated in the original figures) confirmed the three issues: (a) a 2:1 traffic ratio between two servers, (b) a newly added server C received no traffic, and (c) when servers outnumber clients, server C remained idle.
Root‑cause analysis revealed that QLB‑4 operates at OSI layer 4 using a connection‑hash routing table. The table records the chosen backend for each client connection, so subsequent packets are always forwarded to the same server. This design leads to:
Static server assignment for each client (Problem 1).
Stale routing entries when server topology changes, requiring client restart to rebuild the table (Problem 2).
Fixed one‑to‑one mapping that wastes server resources when there are more servers than clients (Problem 3).
Two solution paths were proposed:
Solution 1 – Optimizing QLB‑4
Improve the hash routing table by clearing it whenever the server pool changes or by adding an expiration time to entries. Alternatively, remove the hash table entirely and perform real‑time scheduling, though this may reduce throughput under high concurrency.
Solution 2 – Client‑Side Load Balancing
Implement client‑side load balancing using the company’s service registry (Consul) and the Skywalker micro‑service framework. The client periodically fetches the list of healthy TFServing instances from Consul, builds a channel pool, and selects a server for each RPC using a chosen algorithm (e.g., round‑robin). This approach eliminates the extra hop introduced by QLB‑4, reduces latency, and automatically adapts to server changes without requiring client restarts.
After evaluating both options, the team adopted the client‑side load‑balancing solution. The final deployment stabilized TFServing calls, eliminated uneven traffic distribution, removed the need for client restarts during server scaling, and reduced GPU resource waste, thereby lowering operational costs.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.