Artificial Intelligence 16 min read

Seamlessly Switch Between DeepSeek‑R1 and QwQ‑32B with Higress AI Gateway

Learn how to deploy the new QwQ‑32B inference model alongside DeepSeek‑R1 using the Higress AI gateway, covering environment setup, model configuration, routing, token‑level rate limiting, content safety, semantic caching, and advanced features like automatic fallback and internet‑search integration.

Alibaba Cloud Developer

Mar 10, 2025

Seamlessly Switch Between DeepSeek‑R1 and QwQ‑32B with Higress AI Gateway

Overview

Alibaba's Tongyi QwQ‑32B model (320 B parameters) matches the performance of DeepSeek‑R1 (6.71 T parameters) while offering lower deployment costs. Personal users can run it locally on smaller devices, and enterprises can reduce API inference costs by over 90%.

Cost Comparison

API pricing:

DeepSeek‑R1: $0.14 per million input tokens, $2.19 per million output tokens

QwQ‑32B: $0.20 per million input tokens, $0.20 per million output tokens

Self‑hosted (Alibaba Cloud PAI) costs:

DeepSeek‑R1: at least 2 × 8‑card H20 servers, >100 W per year

QwQ‑32B: 1 × single‑card H20 server, >5 W per year

Tutorial: Switching Models with Higress AI Gateway

1. Environment Preparation

Install Higress (requires Docker) with a single command:

# One‑click install Higress (requires Docker environment)
curl -sS https://higress.cn/ai-gateway/install.sh | bash

After installation, access the console at http://localhost:8001 and complete the initialization.

2. Model Integration Configuration

In the Higress console, add connections for both DeepSeek‑R1 and QwQ‑32B. Use the vendor name for third‑party models or OpenAI‑compatible mode for self‑hosted models.

For self‑hosted models, provide the baseURL of the OpenAI‑compatible endpoint.

3. Create Routing Rules

Define routes that match the model name and forward requests to the corresponding backend.

Example: a route named my-qwq-32b forwards to the QwQ‑32B service; my-deepseek-r1 forwards to DeepSeek‑R1.

Apply AI‑Token rate‑limit plugin to DeepSeek‑R1 and configure fallback to QwQ‑32B when limits are hit.

4. Client Call Example (Python)

from openai import OpenAI

# Unified access via Higress gateway
client = OpenAI(
    api_key="higress-api-key",  # generated in Higress console
    base_url="http://localhost:8080/v1"  # Higress gateway address
)

# Call DeepSeek model
response_deepseek = client.chat.completions.create(
    model="deepseek-r1",
    messages=[{"role": "user", "content": "解释量子计算"}]
)

# Call QwQ model
response_qwq = client.chat.completions.create(
    model="qwq-32b",
    messages=[{"role": "user", "content": "写一首七言诗"}]
)

5. QwQ‑32B Performance

QwQ‑32B delivers extremely fast token output on a single H20 card.

Advanced Features of Higress AI Gateway

1. Multi‑Model Service

Supports simultaneous deployment of multiple LLMs (e.g., DeepSeek, Qwen, self‑hosted models) with front‑end selection and fallback capabilities.

2. Consumer Authentication

API‑Key based multi‑tenant isolation, RBAC for fine‑grained permission control, and audit logging for compliance.

3. Model Auto‑Switch (Fallback)

If a model request fails, the gateway automatically falls back to an alternative model to ensure service continuity.

4. Token‑Level Rate Limiting

Limits token consumption per consumer or API key, preventing resource exhaustion and reducing costs.

5. Content Safety & Compliance

Integrates Alibaba Cloud content‑security services to filter harmful inputs/outputs and enforce industry‑specific compliance (finance, healthcare, etc.).

6. Semantic Caching

Caches LLM responses in Redis to reduce token costs and latency for repetitive queries.

7. Internet Search + Full‑Text Retrieval

Enhances LLMs with search‑augmented generation, retrieving full web pages rather than just snippets.

8. Model Observability

Provides detailed metrics (token consumption per consumer/model, rate‑limit stats, cache hits, security alerts) and integrates with Alibaba Cloud logging and tracing services.