How AI-Driven Gateways Are Evolving to Meet LLM Demands
The article examines how AI-era large language model (LLM) applications impose new traffic, security, and scalability requirements on gateways, and explains how the Envoy‑based open‑source Higress gateway addresses these challenges with hot configuration updates, token‑based rate limiting, streaming support, and multi‑tenant capabilities.
Gateway Role and New AI‑Era Requirements
Traditional gateways provide data forwarding, protocol conversion, load balancing, access control, authentication, security, content moderation and API management. In the AI era, user interaction shifts from UGC to AIGC, creating new infrastructure demands for handling long‑lived connections, high latency inference, and large bidirectional bandwidth.
Traffic Characteristics of LLM Applications
Long‑lived connections : WebSocket, Server‑Sent Events (SSE) and gRPC keep streams open, so configuration changes must not disrupt active streams.
High latency : LLM inference adds seconds of delay, making services vulnerable to slow‑request attacks that consume server resources.
Large bandwidth : Prompt‑completion streaming consumes far more bandwidth than typical web traffic and requires efficient streaming and memory reclamation.
Limitations of Classic Nginx‑Based Gateways
Nginx reloads the entire configuration, which breaks both downstream and upstream connections. It also lacks built‑in security features required for AI workloads, prompting the adoption of Envoy‑based open‑source gateways.
Higress: Envoy‑Based Cloud‑Native AI Gateway
Higress (https://github.com/alibaba/higress) uses Envoy as the data plane and provides hot‑reloadable configuration via the xDS APIs:
LDS (Listener Discovery Service) – downstream listeners.
CDS (Cluster Discovery Service) – upstream service clusters.
RDS (Route Discovery Service) – routing rules.
SDS (Secret Discovery Service) – TLS certificates.
Each discovery service can be updated independently, allowing configuration changes without terminating existing connections.
Security and Rate Limiting for AI Workloads
Beyond traditional QPS/QPM limits, Higress introduces token‑based throttling (TPM/TPH/TPD) that measures usage by LLM token count. Limits can be applied per API, IP, Cookie, Header, URL parameter or Bearer token, and protective limits (e.g., IP‑based) mitigate free‑tier abuse and scraping.
Content Safety Integration
Higress can plug into Alibaba Cloud Content Safety to filter harmful, misleading or illegal LLM outputs, ensuring compliance of AI‑generated responses.
Streaming Support via Wasm Plugins
Envoy’s Wasm extension model enables true streaming of request and response bodies, reducing memory pressure compared with Nginx’s buffered approach. Example request‑body handler:
func onHttpRequestBody(ctx wrapper.HttpContext, config Config, chunk []byte, isLastChunk bool, log wrapper.Log) []byte {
log.Infof("receive request body chunk:%s, isLastChunk:%v", chunk, isLastChunk)
return chunk
}Massive Multi‑Tenant and Domain Management
Higress shards routing rules at the domain level, supporting tens of thousands of domains. New routes become active within seconds, and on‑demand loading keeps only active domain configurations in memory, dramatically lowering resource consumption.
Supported LLM Providers
Higress unifies access to a wide range of models, including Tongyi Qianwen, OpenAI/Azure OpenAI, Baichuan, Zhipu AI, Anthropic Claude, DeepSeek, Ollama and others, providing consistent API, content moderation and A/B testing across providers.
Enterprise AI Gateway Use Cases
Cost‑sharing : Token‑level observability enables per‑department billing.
Stability : Automatic failover to backup models when the primary model is unavailable.
Cost reduction : Vector‑based caching of similar requests lowers API spend.
Access control : Rate limits per employee manage overall consumption.
Content safety : Centralized filtering prevents leakage of sensitive data.
In this architecture the gateway functions as an ESB‑like layer governing all AI‑driven traffic within an organization.
Future Outlook
As AI agents become primary consumers of web content, APIs will become first‑class citizens. Clear API design, versioning and declarative “operable capabilities” will be essential for AI to understand and act on web pages, driving a new wave of AI‑centric internet evolution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
