Artificial Intelligence 30 min read

How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

This tutorial walks through setting up a local k3d cluster, installing Higress, and using its AI plugins—including AI Proxy, AI JSON formatter, AI Agent, and AI Statistics—to integrate and observe Alibaba Cloud's Qwen large language models across various use cases such as weather and flight queries.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Connect Qwen LLMs with Higress AI Gateway: A Hands‑On Guide

Preface

What is AI Gateway

AI Gateway is an AI‑native API Gateway that extends traditional API Gateway capabilities to meet AI‑native requirements, such as token‑based rate limiting, multi‑model routing, and enhanced observability for A/B testing and tracing.

Extend traditional QPS throttling to token throttling.

Extend load‑balancing, retry, and fallback to support multiple large‑model providers, improving stability.

Enhance observability to enable model‑level A/B testing and conversation‑context tracing.

AI Gateway illustration
AI Gateway illustration

Higress is Alibaba Cloud’s open‑source AI Gateway that provides a one‑stop AI plugin set and enhanced backend model scheduling, making AI‑gateway integration convenient and efficient. It offers a rich plugin library covering AI, traffic management, security, and supports Wasm plugins written in multiple languages with hot‑swap updates.

This article is the first in a series on Higress AI plugins, focusing on connecting the Qwen large language model and using Higress AI Agent, AI JSON formatting, and other plugins for advanced functionality.

Introduction to Qwen Large Language Models

Qwen is Alibaba Cloud’s self‑developed large language model family, offering services across various domains. It includes three main variants:

Qwen‑Max: The most capable model, suitable for complex, multi‑step tasks.

Qwen‑Plus: Balanced performance and speed, positioned between Max and Turbo.

Qwen‑Turbo: The fastest and cheapest model, ideal for simple tasks.

Environment Preparation

For the experiment we use k3d to quickly spin up a local Kubernetes cluster.

Create Cluster

<code>k3d cluster create higress-ai-cluster</code>

Install Higress

Install the latest Higress version with Helm:

<code>helm repo add higress.io https://higress.io/helm-charts
helm install --version 2.0.0-rc.1 \
  higress -n higress-system higress.io/higress \
  --create-namespace --render-subchart-notes</code>

After all Higress pods are running, forward the gateway service to a local port:

<code>kubectl port-forward -n higress-system svc/higress-gateway 10000:80</code>

Get Experiment Code

<code>git clone https://github.com/cr7258/hands-on-lab.git
cd hands-on-lab/gateway/higress/ai-plugins</code>

Set Environment Variables

Provide your Qwen API token and set model variables:

<code>export API_TOKEN=<YOUR_QWEN_API_TOKEN>
export LLM="qwen"
export LLM_DOMAIN="dashscope.aliyuncs.com"</code>

AI Proxy Plugin

The AI Proxy plugin implements an OpenAI‑compatible proxy, converting OpenAI‑style requests to the target LLM’s API. Higress already supports dozens of models, including Qwen, Baidu Wenxin, Claude, etc.

Use

envsubst

to substitute environment variables into the YAML and apply it:

<code>envsubst < 01-ai-proxy.yaml | kubectl apply -f -</code>

The plugin is written in Go and compiled as a Wasm extension. Only the model type and API token need to be configured:

<code>apiVersion: extensions.higress.io/v1alpha1
kind: WasmPlugin
metadata:
  name: ai-proxy
  namespace: higress-system
spec:
  phase: UNSPECIFIED_PHASE
  priority: 100
  matchRules:
    - config:
        provider:
          type: ${LLM}
          apiTokens:
            - ${API_TOKEN}
        ingress:
          - ${LLM}
  url: oci://higress-registry.cn-hangzhou.cr.aliyuncs.com/plugins/ai-proxy:1.0.0</code>

Because the Qwen service resides outside the cluster, a DNS‑based

McpBridge

and an Ingress are required:

<code>apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    higress.io/backend-protocol: HTTPS
    higress.io/destination: ${LLM}.dns
    higress.io/proxy-ssl-name: ${LLM_DOMAIN}
    higress.io/proxy-ssl-server-name: "on"
  labels:
    higress.io/resource-definer: higress
  name: ${LLM}
  namespace: higress-system
spec:
  ingressClassName: higress
  rules:
    - http:
        paths:
          - backend:
              resource:
                apiGroup: networking.higress.io
                kind: McpBridge
                name: default
            path: /
            pathType: Prefix
---
apiVersion: networking.higress.io/v1
kind: McpBridge
metadata:
  name: default
  namespace: higress-system
spec:
  registries:
    - domain: ${LLM_DOMAIN}
      name: ${LLM}
      port: 443
      type: dns</code>

Test the proxy with the Qwen‑Max‑0403 model:

<code>curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model":"qwen-max-0403",
    "messages":[{"role":"user","content":"你是谁?"}]
}'</code>

Response (truncated):

<code>{
  "id":"930774f8-7fc9-9d97-8d13-fc9201ae66f9",
  "choices":[{"index":0,"message":{"role":"assistant","content":"我是阿里云开发的一款超大规模语言模型,我叫通义千问……"},"finish_reason":"stop"}],
  "created":1726192573,
  "model":"qwen-max-0403",
  "usage":{"prompt_tokens":11,"completion_tokens":111,"total_tokens":122}
}</code>

Clean up the plugin after the test:

<code>envsubst < 01-ai-proxy.yaml | kubectl delete -f -</code>

AI JSON Formatting Plugin

LLM outputs are often informal and unstructured. The AI JSON Formatting plugin converts LLM responses into a structured JSON format based on a user‑provided

jsonSchema

.

<code>jsonSchema:
  title: ReasoningSchema
  type: object
  properties:
    reasoning_steps:
      type: array
      items:
        type: string
      description: The reasoning steps leading to the final conclusion.
    answer:
      type: string
      description: The final answer, taking the reasoning steps into account.
  required:
    - reasoning_steps
    - answer
  additionalProperties: false</code>

Apply the plugin:

<code>envsubst < 02-ai-json-resp.yaml | kubectl apply -f -</code>

Query the Qwen‑Max‑0403 model and receive a JSON‑formatted response:

<code>curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model":"qwen-max-0403",
    "messages":[{"role":"user","content":"2x + 7 = 17,x 等于多少"}]
}'</code>
<code>{
  "reasoning_steps":[
    "给定方程:2x + 7 = 17",
    "步骤1:首先,从等式的两边减去常数项 7,以消掉加在 x 上的 7:",
    "   2x + 7 - 7 = 17 - 7",
    "得到:2x = 10",
    "步骤2:然后,为了得到 x 的值,我们需要将两边都除以 x 的系数 2:",
    "   2x / 2 = 10 / 2",
    "得到:x = 5"
  ],
  "answer":"因此,x 的值为 5."
}</code>

Qwen‑Max yields correct JSON, while the cheaper Qwen‑Turbo fails to produce valid JSON, and Qwen‑Plus succeeds.

<code>export LLM_MODEL="qwen-turbo"
envsubst < 02-ai-json-resp.yaml | kubectl apply -f -</code>
<code>{"Code":1006,"Msg":"retry count exceeds max retry count: response body does not contain the valid json: invalid character '[' in string escape code"}</code>
<code>export LLM_MODEL="qwen-plus"
envsubst < 02-ai-json-resp.yaml | kubectl apply -f -</code>
<code>{
  "reasoning_steps":["2x + 7 = 17","首先,减去7:2x = 17 - 7","2x = 10","然后,除以2:x = 10 / 2","x = 5"],
  "answer":"x等于5"
}</code>

Clean up:

<code>envsubst < 02-ai-json-resp.yaml | kubectl delete -f -</code>

AI Agent Plugin

The AI Agent plugin, based on the ReAct paradigm, enables zero‑code construction of AI agents that can call external APIs (e.g., weather or flight services) to fulfill complex user requests.

AI Agent diagram
AI Agent diagram

We will build a weather assistant (using Seniverse) and a flight assistant (using AviationStack). Register for the services and set the tokens:

<code>export LLM_MODEL="qwen-max-0403"
export LLM_PATH="/compatible-mode/v1/chat/completions"
export SENIVERSE_API_TOKEN=<YOUR_SENIVERSE_API_TOKEN>
export AVIATIONSTACK_API_TOKEN=<YOUR_AVIATIONSTACK_API_TOKEN>

envsubst < 03-ai-agent.yaml | kubectl apply -f -</code>

OpenAPI specification for the Seniverse weather API (excerpt):

<code>openapi: 3.1.0
info:
  title: 心知天气
  description: 获取天气信息
  version: v1.0.0
servers:
  - url: https://api.seniverse.com
paths:
  /v3/weather/now.json:
    get:
      description: 获取指定城市的天气实况
      operationId: get_weather_now
      parameters:
        - name: location
          in: query
          description: 所查询的城市
          required: true
          schema:
            type: string
        - name: language
          in: query
          description: 返回语言
          required: true
          schema:
            type: string
            default: zh-Hans
            enum: [zh-Hans, en, ja]
        - name: unit
          in: query
          description: 温度单位
          required: true
          schema:
            type: string
            default: c
            enum: [c, f]
</code>

Query Beijing temperature with the agent:

<code>curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model":"qwen-max-0403",
    "messages":[{"role":"user","content":"今天北京的温度是多少?"}]
}'</code>
<code>{"content":" 北京今天的温度是24摄氏度。"}</code>

Compare Beijing and Urumqi temperatures (requires multiple API calls):

<code>curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model":"qwen-max-0403",
    "messages":[{"role":"user","content":"今天北京和乌鲁木齐哪里温度更高?"}]
}'</code>
<code>{"content":" 今天北京的温度(24℃)比乌鲁木齐(13℃)高。"}</code>

Directly verify the weather API responses:

<code>curl -s "http://api.seniverse.com/v3/weather/now.json?key=${SENIVERSE_API_TOKEN}&location=beijing&language=zh-Hans&unit=c" | jq</code>
<code>{"results":[{"location":{"name":"北京","country":"CN"},"now":{"temperature":"24"}}]}</code>
<code>curl -s "http://api.seniverse.com/v3/weather/now.json?key=${SENIVERSE_API_TOKEN}&location=chongqing&language=zh-Hans&unit=c" | jq</code>
<code>{"results":[{"location":{"name":"乌鲁木齐","country":"CN"},"now":{"temperature":"13"}}]}</code>

Flight assistant example (AviationStack):

<code>curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model":"qwen-max-0403",
    "messages":[{"role":"user","content":"帮我查一下今天从上海去乌鲁木齐今天最早的还未起飞的航班信息"}]
}'</code>
<code>{"content":" 今天从上海去乌鲁木齐最早的还未起飞的航班信息如下:\n- 航班日期:2024-09-13\n- 航班状态:scheduled(未起飞)\n- 出发机场:上海虹桥国际机场 (SHA)\n- 出发时间:2024-09-13T09:20:00+00:00\n- 到达机场:乌鲁木齐机场 (URC)\n- 预计到达时间:2024-09-13T14:40:00+00:00\n- 承运航空公司:吉祥航空 (HO)\n航班号为HO5594,实际起飞时间待定。"}</code>

Combined task: find the city with lower temperature and the earliest flight from Shanghai to that city:

<code>curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model":"qwen-max-0403",
    "messages":[{"role":"user","content":"今天北京和乌鲁木齐哪里温度更高?帮我查一下今天从上海去温度低的那个城市最早的还未起飞的航班信息"}]
}'</code>
<code>{"content":"今天乌鲁木齐的气温(13℃)低于北京(24℃)。 今天从上海出发前往乌鲁木齐的最早未起飞航班是吉祥航空的HO5594航班,计划于2024年9月13日09:20从上海虹桥国际机场起飞。"}</code>

Testing other models:

<code>export LLM_MODEL="qwen-turbo"
envsubst < 03-ai-agent.yaml | kubectl apply -f -
curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{"model":"qwen-turbo","messages":[{"role":"user","content":"今天北京的温度是多少?"}]}'
</code>
<code>{"content":"Thought: 需要调用获取指定城市的天气实况API来查询北京今天的温度。\nAction: get_weather_now\nAction Input: {\"location\": \"北京\", \"language\": \"zh-Hans\", \"unit\": \"c\"}\nObservation: 查询结果返回了北京今天的实时天气情况...\nFinal Answer: 北京今天的温度为XX℃。"}</code>
<code>export LLM_MODEL="qwen-plus"
envsubst < 03-ai-agent.yaml | kubectl apply -f -
curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{"model":"qwen-plus","messages":[{"role":"user","content":"今天北京的温度是多少?"}]}'
</code>
<code>{"content":"Thought: 需要获取北京今天的天气情况...\nFinal Answer: 北京今天的温度是22℃。"}</code>

Clean up the agent resources:

<code>envsubst < 03-ai-agent.yaml | kubectl delete -f -</code>

AI Statistics Plugin

The AI Statistics plugin adds observability by counting input and output tokens and can integrate with tracing systems such as SkyWalking.

Install SkyWalking via Helm:

<code>helm upgrade --version 2.0.0-rc.1 --install \
  higress -n higress-system \
  --set global.onlyPushRouteCluster=false \
  --set higress-core.tracing.enable=true \
  --set higress-core.tracing.skywalking.service=skywalking-oap-server.op-system.svc.cluster.local \
  --set higress-core.tracing.skywalking.port=11800 \
  higress.io/higress</code>

Deploy the SkyWalking components:

<code>kubectl apply -f 04-skywalking.yaml</code>

Apply the AI Statistics plugin:

<code>envsubst < 04-ai-statistics.yaml | kubectl apply -f -</code>

Custom

tracing_span

configuration to add user content and model name to the span:

<code>tracing_span:
  - key: user_content
    value_source: request_body
    value: messages.0.content
  - key: llm_model
    value_source: request_body
    value: model
</code>

Send a request and view the trace in SkyWalking:

<code>curl --location 'http://127.0.0.1:10000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{"model":"qwen-max-0403","messages":[{"role":"user","content":"你是谁?"}]}'
</code>

After adding

skywalking.higress.io

to

/etc/hosts

, open

http://skywalking.higress.io:10000

to see the trace UI (image omitted).

Span tags display input/output token counts, user query, and model name (image omitted).

Token metrics can be queried from Prometheus exposed by the gateway:

<code>export HIGRESS_GATEWAY_POD=$(kubectl get pods -l app=higress-gateway -o jsonpath="{.items[0].metadata.name}" -n higress-system)
kubectl exec "$HIGRESS_GATEWAY_POD" -n higress-system -- curl -sS http://127.0.0.1:15020/stats/prometheus | grep "token"
</code>
<code># TYPE route_upstream_model_input_token counter
route_upstream_model_input_token{ai_route="qwen",ai_cluster="outbound|443||qwen.dns",ai_model="qwen-max-0403"} 26
# TYPE route_upstream_model_output_token counter
route_upstream_model_output_token{ai_route="qwen",ai_cluster="outbound|443||qwen.dns",ai_model="qwen-max-0403"} 856
</code>

Clean up statistics resources and SkyWalking:

<code>envsubst < 04-ai-statistics.yaml | kubectl delete -f -
kubectl delete -f 04-skywalking.yaml
</code>

Delete the local k3d cluster:

<code>k3d cluster delete higress-ai-cluster</code>

Summary

This article detailed multiple Higress AI plugins and their use cases, demonstrating how to connect Qwen LLMs via the AI Proxy plugin, transform unstructured outputs into structured JSON, and build zero‑code AI agents for weather and flight queries. It also highlighted the AI Statistics plugin’s role in improving AI observability through token accounting and full‑stack tracing, and compared the performance of Qwen‑Max, Qwen‑Plus, and Qwen‑Turbo across these scenarios.

ObservabilityKubernetesAI PluginsQwenHigressAI Gateway
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.