11 min read

How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms

This article explains why traditional load‑balancing methods fall short for large language model services and introduces Higress AI Gateway's three specialized algorithms—global minimum‑request, prefix‑matching, and GPU‑aware load balancing—detailing their design, Redis‑based implementation, deployment steps, and performance gains.

Alibaba Cloud Native

Aug 21, 2025

How Higress AI Gateway Optimizes LLM Load Balancing with Global, Prefix, and GPU‑Aware Algorithms

Background

Traditional load‑balancing algorithms (round‑robin, random, least‑request, consistent hashing) assume homogeneous request cost and focus on generic web‑service metrics. LLM inference workloads break these assumptions because request compute cost varies widely, GPU memory is the primary bottleneck, and KV‑cache can be reused across requests that share the same prompt prefix.

Problems for LLM Services

Task‑cost variance : a long generation may require many more GPU cycles than a short classification.

GPU‑level resource awareness missing : conventional balancers cannot see per‑GPU memory or compute usage, leading to out‑of‑memory rejections or idle GPUs.

KV‑cache reuse ignored : identical or overlapping prefixes are not routed to the same pod, missing cache‑reuse opportunities.

Higress AI Gateway LLM‑aware Load‑Balancing

Three WASM plugins implement LLM‑specific scheduling:

Global minimum‑request load balancing

Prefix‑matching load balancing

GPU‑aware load balancing

Global Minimum‑Request Load Balancing

The plugin stores a counter per LLM pod in Redis. When a request arrives, the pod with the smallest active‑request count is selected. Counters are updated in the HttpStreamDone callback to handle aborted or failed streams, guaranteeing accurate accounting.

Prefix‑Matching Load Balancing

For multi‑turn conversations the plugin builds a prefix tree in Redis:

Split the OpenAI‑style messages array into blocks delimited by user entries.

Compute a SHA‑1 hash for each block.

Query Redis for the hash of the first block; if absent, fall back to global minimum‑request selection and store the new prefix.

If present, iteratively XOR successive block hashes and query Redis until a matching prefix is found or all blocks are processed.

A matching prefix routes the request to the same pod, enabling KV‑cache reuse and reducing latency and token‑throughput.

GPU‑Aware Load Balancing

The plugin periodically pulls metrics exposed by LLM servers (currently vLLM). Metrics include:

GPU memory usage / KV‑cache occupancy

Queue length of pending requests

LoRA adapter affinity (which adapter a pod is specialized for)

The scheduler selects the pod with the most favorable combination of these metrics. Because the logic runs inside a WASM extension, it works both in Kubernetes and in plain VM deployments without an external sidecar.

Deployment Procedure

Provision a Redis instance (e.g., Alibaba Cloud Redis) and record its host, port, and password.

Deploy one or more LLM inference services (e.g., three vLLM nodes serving Llama‑3) on ECS or Kubernetes.

Add the Redis service (DNS type) and the LLM service (fixed address) to the Higress console.

Create an LLM API object that points to the LLM service.

Install the ai-load-balancer plugin and configure it. Example configuration:

lb_policy: prefix_cache
lb_config:
  serviceFQDN: redis.dns
  servicePort: 6379
  username: default
  password: xxxxxxxxxxxx
  redisKeyTTL: 60

Performance Evaluation

Using NVIDIA GenAI‑Perf (average 200 input tokens, 800 output tokens, 20 concurrent sessions, 5 rounds per session) the three plugins were benchmarked against a baseline without load balancing. Both latency and token‑throughput improved markedly; the prefix‑matching plugin achieved the greatest reduction by reusing KV‑cache.

References

gateway‑api‑inference‑extension repository: https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main

LLM Load Balancing Redis GPU

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.