Cloud Native 11 min read

How to Build an AI‑Native API Gateway with Higress: ChatGPT‑Next‑Web, RAG, Token Limits & More

This guide walks through creating a full‑featured AI‑native API gateway using Higress, covering architecture setup, AI agent integration, observability, content security, token rate limiting, caching, retrieval‑augmented generation, prompt templates, and intelligent request/response transformation with concrete configuration examples.

Alibaba Cloud Native

Aug 2, 2024

How to Build an AI‑Native API Gateway with Higress: ChatGPT‑Next‑Web, RAG, Token Limits & More

Overview

The article shows how to use Higress, an open‑source API gateway, to create an AI‑native gateway that routes requests to large language model (LLM) providers such as ChatGPT‑Next‑Web and Alibaba Cloud Qwen. The gateway can expose an OpenAI‑compatible endpoint, balance traffic across providers, and add observability, security, rate‑limiting, caching, retrieval‑augmented generation (RAG), prompt engineering, and request/response transformation capabilities.

AI Agent Plugin

Configuring the AI agent plugin enables multi‑provider load balancing and token‑based rate limiting. Example configuration for the Qwen provider:

provider:
  type: qwen
  apiTokens:
    - sk-xxxxxxxxxxxxxxxxxxxxxx
  timeout: 1200000
  modelMapping:
    'gpt-3.5-turbo': qwen-turbo
    'gpt-4': qwen-max
    '*': qwen-max

The plugin visualises request flow and presents an OpenAI‑compatible endpoint.

AI Observability Plugin

When enabled on the llm route, the observability plugin records token usage per route, service, and model, feeding the data into Higress telemetry for fine‑grained monitoring.

AI Content Security Plugin

This plugin integrates Alibaba Cloud Content Security to filter harmful or non‑compliant model outputs. After enabling the plugin on the llm route, each response is inspected and blocked if it violates policy.

serviceSource: dns
serviceName: green-cip
servicePort: 443
domain: green-cip.cn-hangzhou.aliyuncs.com
ak: xxxxxxxxxxxxxxxxx
sk: xxxxxxxxxxxxxxxxx

AI Token Rate‑Limiting Plugin

The ai-token-ratelimit plugin enforces per‑IP token quotas using a Redis store. The example limits each IP to 100 tokens per minute and returns HTTP 429 when the limit is exceeded.

rule_name: default_rule
rule_items:
  - limit_by_per_ip: from-remote-addr
    limit_keys:
      - key: 0.0.0.0/0
        token_per_minute: 100
redis:
  service_name: redis.static
  service_port: 6379
  username: xxxxxx
  password: xxxxxx
rejected_code: 429
rejected_msg: 您的请求频率过高，请稍后再试。

AI Cache Plugin

The cache plugin stores LLM responses in Redis. Identical subsequent requests are served instantly from cache, reducing latency and cost.

redis:
  serviceName: redis.static
  servicePort: 6379
  timeout: 2000
  username: xxxxxx
  password: xxxxxx

AI Retrieval‑Augmented Generation (RAG) Plugin

RAG combines LLM generation with vector search from Alibaba Cloud Vector Retrieval Service, allowing the gateway to supplement model responses with up‑to‑date knowledge.

dashscope:
  apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
  serviceName: qwen
  servicePort: 443
  domain: dashscope.aliyuncs.com
dashvector:
  apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
  serviceName: dashvector
  servicePort: 443
  domain: vrs-cn-xxxxxxxxxxxxxx.dashvector.cn-hangzhou.aliyuncs.com
  collection: xxxxxxxxxxxxxx

Prompt Engineering Plugins

Two plugins support prompt templates and decorators. Templates let users define reusable request bodies; decorators can prepend or append messages to any request.

templates:
- name: "developer-chat"
  template:
    model: gpt-3.5-turbo
    messages:
    - role: system
      content: "你是一个 {{program}} 专家, 你平时使用的编程语言为 {{language}}"
    - role: user
      content: "帮我写一个 {{program}} 程序, 你的返回结果里面应该只包含python代码"

prepend:
- role: system
  content: "请使用英语回答问题."
append:
- role: user
  content: "每次回答完问题，尝试进行反问"

Intelligent Request/Response Transformation Plugin

The plugin can modify inbound requests or outbound responses, e.g., converting XML responses to JSON and adjusting headers.

response:
  enable: true
  prompt: "帮我修改以下HTTP应答信息，要求：1. content-type修改为application/json；2. body由xml转化为json；3. 移除content-length。"
provider:
  serviceName: qwen
  domain: dashscope.aliyuncs.com
  apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx

When applied to an

httpbin

/xml

endpoint, the plugin returns a JSON representation of the original XML.

References

https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web

https://help.aliyun.com/zh/mse/user-guide/ai-agent

https://help.aliyun.com/zh/mse/user-guide/ai-observable

https://help.aliyun.com/zh/mse/user-guide/ai-content-security

https://help.aliyun.com/zh/mse/user-guide/ai-token-current-limiting

https://help.aliyun.com/zh/mse/user-guide/ai-cache

https://help.aliyun.com/zh/mse/user-guide/ai-rag

https://help.aliyun.com/zh/mse/user-guide/ai-cue-template

https://help.aliyun.com/zh/mse/user-guide/ai-request-response-intelligent-transformation

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM api-gateway Token Limiting

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.