Cloud Native 11 min read

How to Build an AI‑Native API Gateway with Higress: ChatGPT‑Next‑Web, RAG, Token Limits & More

This guide walks through creating a full‑featured AI‑native API gateway using Higress, covering architecture setup, AI agent integration, observability, content security, token rate limiting, caching, retrieval‑augmented generation, prompt templates, and intelligent request/response transformation with concrete configuration examples.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How to Build an AI‑Native API Gateway with Higress: ChatGPT‑Next‑Web, RAG, Token Limits & More

Overview

The article shows how to use Higress, an open‑source API gateway, to create an AI‑native gateway that routes requests to large language model (LLM) providers such as ChatGPT‑Next‑Web and Alibaba Cloud Qwen. The gateway can expose an OpenAI‑compatible endpoint, balance traffic across providers, and add observability, security, rate‑limiting, caching, retrieval‑augmented generation (RAG), prompt engineering, and request/response transformation capabilities.

AI Agent Plugin

Configuring the AI agent plugin enables multi‑provider load balancing and token‑based rate limiting. Example configuration for the Qwen provider:

provider:
  type: qwen
  apiTokens:
    - sk-xxxxxxxxxxxxxxxxxxxxxx
  timeout: 1200000
  modelMapping:
    'gpt-3.5-turbo': qwen-turbo
    'gpt-4': qwen-max
    '*': qwen-max

The plugin visualises request flow and presents an OpenAI‑compatible endpoint.

AI Agent Architecture
AI Agent Architecture

AI Observability Plugin

When enabled on the llm route, the observability plugin records token usage per route, service, and model, feeding the data into Higress telemetry for fine‑grained monitoring.

Observability Effect
Observability Effect

AI Content Security Plugin

This plugin integrates Alibaba Cloud Content Security to filter harmful or non‑compliant model outputs. After enabling the plugin on the llm route, each response is inspected and blocked if it violates policy.

serviceSource: dns
serviceName: green-cip
servicePort: 443
domain: green-cip.cn-hangzhou.aliyuncs.com
ak: xxxxxxxxxxxxxxxxx
sk: xxxxxxxxxxxxxxxxx
Content Security Effect
Content Security Effect

AI Token Rate‑Limiting Plugin

The ai-token-ratelimit plugin enforces per‑IP token quotas using a Redis store. The example limits each IP to 100 tokens per minute and returns HTTP 429 when the limit is exceeded.

rule_name: default_rule
rule_items:
  - limit_by_per_ip: from-remote-addr
    limit_keys:
      - key: 0.0.0.0/0
        token_per_minute: 100
redis:
  service_name: redis.static
  service_port: 6379
  username: xxxxxx
  password: xxxxxx
rejected_code: 429
rejected_msg: 您的请求频率过高,请稍后再试。
Token Limiting Effect
Token Limiting Effect

AI Cache Plugin

The cache plugin stores LLM responses in Redis. Identical subsequent requests are served instantly from cache, reducing latency and cost.

redis:
  serviceName: redis.static
  servicePort: 6379
  timeout: 2000
  username: xxxxxx
  password: xxxxxx
Cache Effect
Cache Effect

AI Retrieval‑Augmented Generation (RAG) Plugin

RAG combines LLM generation with vector search from Alibaba Cloud Vector Retrieval Service, allowing the gateway to supplement model responses with up‑to‑date knowledge.

dashscope:
  apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
  serviceName: qwen
  servicePort: 443
  domain: dashscope.aliyuncs.com
dashvector:
  apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
  serviceName: dashvector
  servicePort: 443
  domain: vrs-cn-xxxxxxxxxxxxxx.dashvector.cn-hangzhou.aliyuncs.com
  collection: xxxxxxxxxxxxxx
RAG Effect
RAG Effect

Prompt Engineering Plugins

Two plugins support prompt templates and decorators. Templates let users define reusable request bodies; decorators can prepend or append messages to any request.

templates:
- name: "developer-chat"
  template:
    model: gpt-3.5-turbo
    messages:
    - role: system
      content: "你是一个 {{program}} 专家, 你平时使用的编程语言为 {{language}}"
    - role: user
      content: "帮我写一个 {{program}} 程序, 你的返回结果里面应该只包含python代码"
prepend:
- role: system
  content: "请使用英语回答问题."
append:
- role: user
  content: "每次回答完问题,尝试进行反问"

Intelligent Request/Response Transformation Plugin

The plugin can modify inbound requests or outbound responses, e.g., converting XML responses to JSON and adjusting headers.

response:
  enable: true
  prompt: "帮我修改以下HTTP应答信息,要求:1. content-type修改为application/json;2. body由xml转化为json;3. 移除content-length。"
provider:
  serviceName: qwen
  domain: dashscope.aliyuncs.com
  apiKey: sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx

When applied to an

httpbin
/xml

endpoint, the plugin returns a JSON representation of the original XML.

Transformation Effect
Transformation Effect

References

https://github.com/ChatGPTNextWeb/ChatGPT-Next-Web

https://help.aliyun.com/zh/mse/user-guide/ai-agent

https://help.aliyun.com/zh/mse/user-guide/ai-observable

https://help.aliyun.com/zh/mse/user-guide/ai-content-security

https://help.aliyun.com/zh/mse/user-guide/ai-token-current-limiting

https://help.aliyun.com/zh/mse/user-guide/ai-cache

https://help.aliyun.com/zh/mse/user-guide/ai-rag

https://help.aliyun.com/zh/mse/user-guide/ai-cue-template

https://help.aliyun.com/zh/mse/user-guide/ai-request-response-intelligent-transformation

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMapi-gatewayToken Limiting
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.