How to Build an AI‑Native API Gateway with Higress: ChatGPT‑Next‑Web, RAG, Token Limits & More
This guide walks through creating a full‑featured AI‑native API gateway using Higress, covering architecture setup, AI agent integration, observability, content security, token rate limiting, caching, retrieval‑augmented generation, prompt templates, and intelligent request/response transformation with concrete configuration examples.
Overview
The article shows how to use Higress, an open‑source API gateway, to create an AI‑native gateway that routes requests to large language model (LLM) providers such as ChatGPT‑Next‑Web and Alibaba Cloud Qwen. The gateway can expose an OpenAI‑compatible endpoint, balance traffic across providers, and add observability, security, rate‑limiting, caching, retrieval‑augmented generation (RAG), prompt engineering, and request/response transformation capabilities.
AI Agent Plugin
Configuring the AI agent plugin enables multi‑provider load balancing and token‑based rate limiting. Example configuration for the Qwen provider:
provider:
type: qwen
apiTokens:
- sk-xxxxxxxxxxxxxxxxxxxxxx
timeout: 1200000
modelMapping:
'gpt-3.5-turbo': qwen-turbo
'gpt-4': qwen-max
'*': qwen-maxThe plugin visualises request flow and presents an OpenAI‑compatible endpoint.
AI Observability Plugin
When enabled on the llm route, the observability plugin records token usage per route, service, and model, feeding the data into Higress telemetry for fine‑grained monitoring.
AI Content Security Plugin
This plugin integrates Alibaba Cloud Content Security to filter harmful or non‑compliant model outputs. After enabling the plugin on the llm route, each response is inspected and blocked if it violates policy.
serviceSource: dns
serviceName: green-cip
servicePort: 443
domain: green-cip.cn-hangzhou.aliyuncs.com
ak: xxxxxxxxxxxxxxxxx
sk: xxxxxxxxxxxxxxxxxAI Token Rate‑Limiting Plugin
The ai-token-ratelimit plugin enforces per‑IP token quotas using a Redis store. The example limits each IP to 100 tokens per minute and returns HTTP 429 when the limit is exceeded.
rule_name: default_rule
rule_items:
- limit_by_per_ip: from-remote-addr
limit_keys:
- key: 0.0.0.0/0
token_per_minute: 100
redis:
service_name: redis.static
service_port: 6379
username: xxxxxx
password: xxxxxx
rejected_code: 429
rejected_msg: 您的请求频率过高,请稍后再试。AI Cache Plugin
The cache plugin stores LLM responses in Redis. Identical subsequent requests are served instantly from cache, reducing latency and cost.
redis:
serviceName: redis.static
servicePort: 6379
timeout: 2000
username: xxxxxx
password: xxxxxxAI Retrieval‑Augmented Generation (RAG) Plugin
RAG combines LLM generation with vector search from Alibaba Cloud Vector Retrieval Service, allowing the gateway to supplement model responses with up‑to‑date knowledge.
dashscope:
apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
serviceName: qwen
servicePort: 443
domain: dashscope.aliyuncs.com
dashvector:
apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
serviceName: dashvector
servicePort: 443
domain: vrs-cn-xxxxxxxxxxxxxx.dashvector.cn-hangzhou.aliyuncs.com
collection: xxxxxxxxxxxxxxPrompt Engineering Plugins
Two plugins support prompt templates and decorators. Templates let users define reusable request bodies; decorators can prepend or append messages to any request.
templates:
- name: "developer-chat"
template:
model: gpt-3.5-turbo
messages:
- role: system
content: "你是一个 {{program}} 专家, 你平时使用的编程语言为 {{language}}"
- role: user
content: "帮我写一个 {{program}} 程序, 你的返回结果里面应该只包含python代码" prepend:
- role: system
content: "请使用英语回答问题."
append:
- role: user
content: "每次回答完问题,尝试进行反问"Intelligent Request/Response Transformation Plugin
The plugin can modify inbound requests or outbound responses, e.g., converting XML responses to JSON and adjusting headers.
response:
enable: true
prompt: "帮我修改以下HTTP应答信息,要求:1. content-type修改为application/json;2. body由xml转化为json;3. 移除content-length。"
provider:
serviceName: qwen
domain: dashscope.aliyuncs.com
apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxWhen applied to an
httpbin /xmlendpoint, the plugin returns a JSON representation of the original XML.
References
https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web
https://help.aliyun.com/zh/mse/user-guide/ai-agent
https://help.aliyun.com/zh/mse/user-guide/ai-observable
https://help.aliyun.com/zh/mse/user-guide/ai-content-security
https://help.aliyun.com/zh/mse/user-guide/ai-token-current-limiting
https://help.aliyun.com/zh/mse/user-guide/ai-cache
https://help.aliyun.com/zh/mse/user-guide/ai-rag
https://help.aliyun.com/zh/mse/user-guide/ai-cue-template
https://help.aliyun.com/zh/mse/user-guide/ai-request-response-intelligent-transformation
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
