How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms
This article explains why traditional load‑balancing methods fall short for large language model services and introduces Higress AI Gateway's three specialized algorithms—global minimum‑request, prefix‑matching, and GPU‑aware load balancing—detailing their design, Redis‑based implementation, deployment steps, and performance gains.
Background
Traditional load‑balancing algorithms (round‑robin, random, least‑request, consistent hashing) assume homogeneous request cost and focus on generic web‑service metrics. LLM inference workloads break these assumptions because request compute cost varies widely, GPU memory is the primary bottleneck, and KV‑cache can be reused across requests that share the same prompt prefix.
Problems for LLM Services
Task‑cost variance : a long generation may require many more GPU cycles than a short classification.
GPU‑level resource awareness missing : conventional balancers cannot see per‑GPU memory or compute usage, leading to out‑of‑memory rejections or idle GPUs.
KV‑cache reuse ignored : identical or overlapping prefixes are not routed to the same pod, missing cache‑reuse opportunities.
Higress AI Gateway LLM‑aware Load‑Balancing
Three WASM plugins implement LLM‑specific scheduling:
Global minimum‑request load balancing
Prefix‑matching load balancing
GPU‑aware load balancing
Global Minimum‑Request Load Balancing
The plugin stores a counter per LLM pod in Redis. When a request arrives, the pod with the smallest active‑request count is selected. Counters are updated in the HttpStreamDone callback to handle aborted or failed streams, guaranteeing accurate accounting.
Prefix‑Matching Load Balancing
For multi‑turn conversations the plugin builds a prefix tree in Redis:
Split the OpenAI‑style messages array into blocks delimited by user entries.
Compute a SHA‑1 hash for each block.
Query Redis for the hash of the first block; if absent, fall back to global minimum‑request selection and store the new prefix.
If present, iteratively XOR successive block hashes and query Redis until a matching prefix is found or all blocks are processed.
A matching prefix routes the request to the same pod, enabling KV‑cache reuse and reducing latency and token‑throughput.
GPU‑Aware Load Balancing
The plugin periodically pulls metrics exposed by LLM servers (currently vLLM). Metrics include:
GPU memory usage / KV‑cache occupancy
Queue length of pending requests
LoRA adapter affinity (which adapter a pod is specialized for)
The scheduler selects the pod with the most favorable combination of these metrics. Because the logic runs inside a WASM extension, it works both in Kubernetes and in plain VM deployments without an external sidecar.
Deployment Procedure
Provision a Redis instance (e.g., Alibaba Cloud Redis) and record its host, port, and password.
Deploy one or more LLM inference services (e.g., three vLLM nodes serving Llama‑3) on ECS or Kubernetes.
Add the Redis service (DNS type) and the LLM service (fixed address) to the Higress console.
Create an LLM API object that points to the LLM service.
Install the ai-load-balancer plugin and configure it. Example configuration:
lb_policy: prefix_cache
lb_config:
serviceFQDN: redis.dns
servicePort: 6379
username: default
password: xxxxxxxxxxxx
redisKeyTTL: 60Performance Evaluation
Using NVIDIA GenAI‑Perf (average 200 input tokens, 800 output tokens, 20 concurrent sessions, 5 rounds per session) the three plugins were benchmarked against a baseline without load balancing. Both latency and token‑throughput improved markedly; the prefix‑matching plugin achieved the greatest reduction by reusing KV‑cache.
References
gateway‑api‑inference‑extension repository: https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
