Cloud Native 19 min read

How AI-Driven Gateways Are Evolving to Meet LLM Demands

The article examines how AI-era large language model (LLM) applications impose new traffic, security, and scalability requirements on gateways, and explains how the Envoy‑based open‑source Higress gateway addresses these challenges with hot configuration updates, token‑based rate limiting, streaming support, and multi‑tenant capabilities.

Alibaba Cloud Native

Jul 15, 2024

How AI-Driven Gateways Are Evolving to Meet LLM Demands

Gateway Role and New AI‑Era Requirements

Traditional gateways provide data forwarding, protocol conversion, load balancing, access control, authentication, security, content moderation and API management. In the AI era, user interaction shifts from UGC to AIGC, creating new infrastructure demands for handling long‑lived connections, high latency inference, and large bidirectional bandwidth.

Traffic Characteristics of LLM Applications

Long‑lived connections : WebSocket, Server‑Sent Events (SSE) and gRPC keep streams open, so configuration changes must not disrupt active streams.

High latency : LLM inference adds seconds of delay, making services vulnerable to slow‑request attacks that consume server resources.

Large bandwidth : Prompt‑completion streaming consumes far more bandwidth than typical web traffic and requires efficient streaming and memory reclamation.

Limitations of Classic Nginx‑Based Gateways

Nginx reloads the entire configuration, which breaks both downstream and upstream connections. It also lacks built‑in security features required for AI workloads, prompting the adoption of Envoy‑based open‑source gateways.

Higress: Envoy‑Based Cloud‑Native AI Gateway

Higress (https://github.com/alibaba/higress) uses Envoy as the data plane and provides hot‑reloadable configuration via the xDS APIs:

LDS (Listener Discovery Service) – downstream listeners.

CDS (Cluster Discovery Service) – upstream service clusters.

RDS (Route Discovery Service) – routing rules.

SDS (Secret Discovery Service) – TLS certificates.

Each discovery service can be updated independently, allowing configuration changes without terminating existing connections.

Security and Rate Limiting for AI Workloads

Beyond traditional QPS/QPM limits, Higress introduces token‑based throttling (TPM/TPH/TPD) that measures usage by LLM token count. Limits can be applied per API, IP, Cookie, Header, URL parameter or Bearer token, and protective limits (e.g., IP‑based) mitigate free‑tier abuse and scraping.

Content Safety Integration

Higress can plug into Alibaba Cloud Content Safety to filter harmful, misleading or illegal LLM outputs, ensuring compliance of AI‑generated responses.

Streaming Support via Wasm Plugins

Envoy’s Wasm extension model enables true streaming of request and response bodies, reducing memory pressure compared with Nginx’s buffered approach. Example request‑body handler:

func onHttpRequestBody(ctx wrapper.HttpContext, config Config, chunk []byte, isLastChunk bool, log wrapper.Log) []byte {
    log.Infof("receive request body chunk:%s, isLastChunk:%v", chunk, isLastChunk)
    return chunk
}

Massive Multi‑Tenant and Domain Management

Higress shards routing rules at the domain level, supporting tens of thousands of domains. New routes become active within seconds, and on‑demand loading keeps only active domain configurations in memory, dramatically lowering resource consumption.

Supported LLM Providers

Higress unifies access to a wide range of models, including Tongyi Qianwen, OpenAI/Azure OpenAI, Baichuan, Zhipu AI, Anthropic Claude, DeepSeek, Ollama and others, providing consistent API, content moderation and A/B testing across providers.

Enterprise AI Gateway Use Cases

Cost‑sharing : Token‑level observability enables per‑department billing.

Stability : Automatic failover to backup models when the primary model is unavailable.

Cost reduction : Vector‑based caching of similar requests lowers API spend.

Access control : Rate limits per employee manage overall consumption.

Content safety : Centralized filtering prevents leakage of sensitive data.

In this architecture the gateway functions as an ESB‑like layer governing all AI‑driven traffic within an organization.

Future Outlook

As AI agents become primary consumers of web content, APIs will become first‑class citizens. Clear API design, versioning and declarative “operable capabilities” will be essential for AI to understand and act on web pages, driving a new wave of AI‑centric internet evolution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM Security Infra

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.