How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing
This article explains the design of an AI gateway that centralizes LLM access, detailing its background, overall architecture, authentication, quota management, multi‑model routing, load‑balancing strategies, multi‑tenant isolation, observability features, and the supported API protocols for enterprise integration.
01 Background
With the rapid growth of AI technology, businesses increasingly demand AI capabilities. When AI services must handle massive request volumes and high concurrency, an AI gateway becomes essential for coordinating those requests, ensuring system stability and efficiency. By consolidating functions such as traffic control, authentication, quota billing, load balancing, and API routing at the gateway layer, overall system complexity is reduced and maintainability is improved.
02 AI Gateway Overview
The AI gateway provides a unified entry point for large language model (LLM) services, supporting multiple vendors, multiple models, and load‑balanced dispatch. It also offers unified authentication, token‑quota management, security auditing, and observability to guarantee safe and stable API calls. The load‑balancing module can route requests based on provider, model, and API key, making it suitable for multi‑model and multi‑tenant scenarios.
Overall Architecture
The AI gateway’s architecture mirrors a traditional API gateway, with almost identical designs on the data plane and control plane.
Derived from an existing API gateway, the AI gateway adds AI‑specific optimizations such as buffer‑less request proxying, mixed domain‑based and service‑discovery scheduling, and graceful termination of long‑running AI requests.
The control plane pushes updated gateway configurations to data‑plane nodes in near real‑time. When a data‑plane node detects a configuration change, it dynamically switches its proxy engine while allowing existing requests to finish under the old logic.
Two filter abstractions exist on the data plane:
Request filter : operates on the raw user request, handling authentication, rate limiting, etc.
Model filter : operates after the request is forwarded to a specific model, handling model‑specific compatibility logic (e.g., processing <think> tags or custom inference parameters).
The control plane also exposes an OpenAPI for model providers to register models, set RPM/TPM limits, and authorize API keys for specific models.
Authentication
The gateway adopts the widely used OpenAI‑compatible API‑Key scheme. Authorization: Bearer <YOUR_API_KEY> Fine‑grained permission control allows each API key to be limited to a specific set of models and to have distinct quotas. API‑key lifetimes can be configured to expire or remain permanent.
Quota Management
The quota system follows OpenAI’s RPM (requests per minute) and TPM (tokens per minute) concepts. Quotas can be assigned per user or per model, and monthly token budgets can be set to prevent cost overruns.
Multi‑Model Access
Currently the gateway forwards only OpenAI‑compatible API calls. Based on the requested model name, the gateway selects the optimal upstream provider (internal IDC or public cloud), substitutes the appropriate API key and upstream domain, and performs load balancing.
For internally hosted models, service‑discovery mechanisms locate inference nodes and route requests directly.
Model Load Balancing
LLM inference differs from traditional APIs: each request incurs unpredictable compute time and GPU usage, making classic RPS‑ or latency‑based balancing ineffective. The gateway’s default strategy evaluates each model node’s token throughput and latency in a black‑box manner to estimate saturation. Additional metrics from the inference engine and GPU queues are also considered. Prefix‑cache‑based node selection and external load‑balancing plugins (via RPC) are supported.
Multi‑Tenant Isolation
Tenants access the gateway using domain + API key . Different domains can be configured to route to distinct model providers, achieving business‑level isolation.
Observability
Key dimensions—Gateway, Domain, Consumer, Provider, UserModel, UpstreamModel—are monitored for availability, QPS, latency, 5xx errors, and quota usage.
03 API Business Scenarios and Integration
The gateway adopts the OpenAI protocol as its base, exposing four main API families:
Chat Completion (CHAT_COMPLETION) : Supports multi‑turn, context‑aware conversations.
Embedding (EMBEDDING) : Converts text into high‑dimensional vectors for retrieval and knowledge‑management tasks.
Chat Template (CHAT_TEMPLATE) : Allows predefined prompt templates with optional built‑in functions (e.g., len(v), jsonify(v), make_json_object(...), slice_to_index_map(v, startBy)) to generate structured responses.
Model Context Protocol (MCP) : An open protocol introduced by Anthropic (2024) that standardizes how LLMs connect to external data sources and tools.
Chat Template Example
- path: /v1/reply-to-en
protocol: HTTP
timeout: 300s
middlewares:
- name: v1_chat_template
options:
'@type': type.googleapis.com/infra.gateway.middleware.llm.v1.contrib.ChatTemplateConfig
provider: bilibili
model_name: index
prompt_template: |
Your task: translate each comment from a B‑site video into English. Input is a JSON map where keys are indices and values are comments. Return a JSON map with the same keys and translated values. Preserve image placeholders {dyn:xxx} and emoji placeholders [xxx] unchanged.
Input: {{ jsonify (slice_to_index_map .reply_list 1) }}
Output:Prompt templates enable efficient LLM integration for tasks such as translation, QA, or arithmetic reasoning without requiring users to master prompt engineering.
Model Context Protocol (MCP)
MCP standardizes resources, prompts, and tools so that LLM clients can uniformly access files, databases, or external APIs.
- path: /example-mcp/*
protocol: HTTP
timeout: 300s
middlewares:
- name: v1_mcp_server
options:
'@type': type.googleapis.com/infra.gateway.middleware.llm.v1.contrib.MCPServerConfig
proxy:
name: example-mcp
upstreams:
- url: 'discovery://infra.example.example-mcp'Another example shows a logging service with JSON‑RPC arguments and request/response templates.
- path: /logging-mcp/*
protocol: HTTP
timeout: 300s
middlewares:
- name: v1_mcp_server
options:
'@type': type.googleapis.com/infra.gateway.middleware.llm.v1.contrib.MCPServerConfig
apiOrchestrator:
server:
name: logging-mcp
tools:
- name: query-logs
description: Retrieve logs for a given environment and app ID
args:
- name: env
description: Deployment environment
type: string
default_value: "uat"
position: query
- name: appid
description: Application ID
type: string
required: true
position: query
- name: level
description: Log level
enum_values: [DEBUG, INFO, WARN, ERROR]
type: string
required: true
position: query
- name: keyword
description: Search keyword
type: string
required: true
position: query
request_template:
upstream: url: http://api.example.com/logging/query?env={{ .env }}&appid={{ .appid }}&level={{ .level }}&keyword={{ .keyword }}
method: GET
response_template:
body: '{{ . }}'04 Enterprise MCP Marketplace and API Integration
The MCP marketplace acts as an internal “App Store” where teams publish MCP services. Once a service is registered in the MCP gateway, other business units can consume it via a unified domain (e.g., https://mcp.example.com/logging-mcp).
Two primary MCP endpoints are provided: /sse: Server‑Sent Events long‑connection for real‑time resource change notifications. /message: JSON‑RPC endpoint for request/response interactions.
05 Summary
The AI gateway unifies access, authentication, quota management, and model dispatch, delivering efficient, secure, and customizable connectivity for large models. By supporting OpenAI‑compatible protocols, chat‑template interfaces, and the MCP marketplace, it greatly simplifies enterprise AI integration and resource sharing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
