Artificial Intelligence 17 min read

How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing

This article explains the design of an AI gateway that centralizes LLM access, detailing its background, overall architecture, authentication, quota management, multi‑model routing, load‑balancing strategies, multi‑tenant isolation, observability features, and the supported API protocols for enterprise integration.

Bilibili Tech

May 9, 2025

How an AI Gateway Scales LLM Services: Architecture, Auth, Quotas, and Load Balancing

01 Background

With the rapid growth of AI technology, businesses increasingly demand AI capabilities. When AI services must handle massive request volumes and high concurrency, an AI gateway becomes essential for coordinating those requests, ensuring system stability and efficiency. By consolidating functions such as traffic control, authentication, quota billing, load balancing, and API routing at the gateway layer, overall system complexity is reduced and maintainability is improved.

02 AI Gateway Overview

The AI gateway provides a unified entry point for large language model (LLM) services, supporting multiple vendors, multiple models, and load‑balanced dispatch. It also offers unified authentication, token‑quota management, security auditing, and observability to guarantee safe and stable API calls. The load‑balancing module can route requests based on provider, model, and API key, making it suitable for multi‑model and multi‑tenant scenarios.

Overall Architecture

The AI gateway’s architecture mirrors a traditional API gateway, with almost identical designs on the data plane and control plane.

Derived from an existing API gateway, the AI gateway adds AI‑specific optimizations such as buffer‑less request proxying, mixed domain‑based and service‑discovery scheduling, and graceful termination of long‑running AI requests.

The control plane pushes updated gateway configurations to data‑plane nodes in near real‑time. When a data‑plane node detects a configuration change, it dynamically switches its proxy engine while allowing existing requests to finish under the old logic.

Two filter abstractions exist on the data plane:

Request filter : operates on the raw user request, handling authentication, rate limiting, etc.

Model filter : operates after the request is forwarded to a specific model, handling model‑specific compatibility logic (e.g., processing <think> tags or custom inference parameters).

The control plane also exposes an OpenAPI for model providers to register models, set RPM/TPM limits, and authorize API keys for specific models.

Authentication

The gateway adopts the widely used OpenAI‑compatible API‑Key scheme. Authorization: Bearer <YOUR_API_KEY> Fine‑grained permission control allows each API key to be limited to a specific set of models and to have distinct quotas. API‑key lifetimes can be configured to expire or remain permanent.

Quota Management

The quota system follows OpenAI’s RPM (requests per minute) and TPM (tokens per minute) concepts. Quotas can be assigned per user or per model, and monthly token budgets can be set to prevent cost overruns.

Multi‑Model Access

Currently the gateway forwards only OpenAI‑compatible API calls. Based on the requested model name, the gateway selects the optimal upstream provider (internal IDC or public cloud), substitutes the appropriate API key and upstream domain, and performs load balancing.

For internally hosted models, service‑discovery mechanisms locate inference nodes and route requests directly.

Model Load Balancing

LLM inference differs from traditional APIs: each request incurs unpredictable compute time and GPU usage, making classic RPS‑ or latency‑based balancing ineffective. The gateway’s default strategy evaluates each model node’s token throughput and latency in a black‑box manner to estimate saturation. Additional metrics from the inference engine and GPU queues are also considered. Prefix‑cache‑based node selection and external load‑balancing plugins (via RPC) are supported.

Multi‑Tenant Isolation

Tenants access the gateway using domain + API key . Different domains can be configured to route to distinct model providers, achieving business‑level isolation.

Observability

Key dimensions—Gateway, Domain, Consumer, Provider, UserModel, UpstreamModel—are monitored for availability, QPS, latency, 5xx errors, and quota usage.

03 API Business Scenarios and Integration

The gateway adopts the OpenAI protocol as its base, exposing four main API families:

Chat Completion (CHAT_COMPLETION) : Supports multi‑turn, context‑aware conversations.

Embedding (EMBEDDING) : Converts text into high‑dimensional vectors for retrieval and knowledge‑management tasks.

Chat Template (CHAT_TEMPLATE) : Allows predefined prompt templates with optional built‑in functions (e.g., len(v), jsonify(v), make_json_object(...), slice_to_index_map(v, startBy)) to generate structured responses.

Model Context Protocol (MCP) : An open protocol introduced by Anthropic (2024) that standardizes how LLMs connect to external data sources and tools.

Chat Template Example

- path: /v1/reply-to-en
  protocol: HTTP
  timeout: 300s
  middlewares:
  - name: v1_chat_template
    options:
      '@type': type.googleapis.com/infra.gateway.middleware.llm.v1.contrib.ChatTemplateConfig
      provider: bilibili
      model_name: index
      prompt_template: |
        Your task: translate each comment from a B‑site video into English. Input is a JSON map where keys are indices and values are comments. Return a JSON map with the same keys and translated values. Preserve image placeholders {dyn:xxx} and emoji placeholders [xxx] unchanged.
        Input: {{ jsonify (slice_to_index_map .reply_list 1) }}
        Output:

Prompt templates enable efficient LLM integration for tasks such as translation, QA, or arithmetic reasoning without requiring users to master prompt engineering.

Model Context Protocol (MCP)

MCP standardizes resources, prompts, and tools so that LLM clients can uniformly access files, databases, or external APIs.

- path: /example-mcp/*
  protocol: HTTP
  timeout: 300s
  middlewares:
  - name: v1_mcp_server
    options:
      '@type': type.googleapis.com/infra.gateway.middleware.llm.v1.contrib.MCPServerConfig
      proxy:
        name: example-mcp
        upstreams:
        - url: 'discovery://infra.example.example-mcp'

Another example shows a logging service with JSON‑RPC arguments and request/response templates.

- path: /logging-mcp/*
  protocol: HTTP
  timeout: 300s
  middlewares:
  - name: v1_mcp_server
    options:
      '@type': type.googleapis.com/infra.gateway.middleware.llm.v1.contrib.MCPServerConfig
      apiOrchestrator:
        server:
          name: logging-mcp
        tools:
        - name: query-logs
          description: Retrieve logs for a given environment and app ID
          args:
          - name: env
            description: Deployment environment
            type: string
            default_value: "uat"
            position: query
          - name: appid
            description: Application ID
            type: string
            required: true
            position: query
          - name: level
            description: Log level
            enum_values: [DEBUG, INFO, WARN, ERROR]
            type: string
            required: true
            position: query
          - name: keyword
            description: Search keyword
            type: string
            required: true
            position: query
          request_template:
            upstream: url: http://api.example.com/logging/query?env={{ .env }}&appid={{ .appid }}&level={{ .level }}&keyword={{ .keyword }}
            method: GET
          response_template:
            body: '{{ . }}'

04 Enterprise MCP Marketplace and API Integration

The MCP marketplace acts as an internal “App Store” where teams publish MCP services. Once a service is registered in the MCP gateway, other business units can consume it via a unified domain (e.g., https://mcp.example.com/logging-mcp).

Two primary MCP endpoints are provided: /sse: Server‑Sent Events long‑connection for real‑time resource change notifications. /message: JSON‑RPC endpoint for request/response interactions.

05 Summary

The AI gateway unifies access, authentication, quota management, and model dispatch, delivering efficient, secure, and customizable connectivity for large models. By supporting OpenAI‑compatible protocols, chat‑template interfaces, and the MCP marketplace, it greatly simplifies enterprise AI integration and resource sharing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM observability load balancing api-gateway Multi‑Tenant Authentication quota management AI gateway

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.