Seamlessly Switch Between DeepSeek‑R1 and QwQ‑32B with Higress AI Gateway
Learn how to deploy the new QwQ‑32B inference model alongside DeepSeek‑R1 using the Higress AI gateway, covering environment setup, model configuration, routing, token‑level rate limiting, content safety, semantic caching, and advanced features like automatic fallback and internet‑search integration.
Overview
Alibaba's Tongyi QwQ‑32B model (320 B parameters) matches the performance of DeepSeek‑R1 (6.71 T parameters) while offering lower deployment costs. Personal users can run it locally on smaller devices, and enterprises can reduce API inference costs by over 90%.
Cost Comparison
API pricing:
DeepSeek‑R1: $0.14 per million input tokens, $2.19 per million output tokens
QwQ‑32B: $0.20 per million input tokens, $0.20 per million output tokens
Self‑hosted (Alibaba Cloud PAI) costs:
DeepSeek‑R1: at least 2 × 8‑card H20 servers, >100 W per year
QwQ‑32B: 1 × single‑card H20 server, >5 W per year
Tutorial: Switching Models with Higress AI Gateway
1. Environment Preparation
Install Higress (requires Docker) with a single command:
# One‑click install Higress (requires Docker environment)
curl -sS https://higress.cn/ai-gateway/install.sh | bashAfter installation, access the console at http://localhost:8001 and complete the initialization.
2. Model Integration Configuration
In the Higress console, add connections for both DeepSeek‑R1 and QwQ‑32B. Use the vendor name for third‑party models or OpenAI‑compatible mode for self‑hosted models.
For self‑hosted models, provide the baseURL of the OpenAI‑compatible endpoint.
3. Create Routing Rules
Define routes that match the model name and forward requests to the corresponding backend.
Example: a route named my-qwq-32b forwards to the QwQ‑32B service; my-deepseek-r1 forwards to DeepSeek‑R1.
Apply AI‑Token rate‑limit plugin to DeepSeek‑R1 and configure fallback to QwQ‑32B when limits are hit.
4. Client Call Example (Python)
from openai import OpenAI
# Unified access via Higress gateway
client = OpenAI(
api_key="higress-api-key", # generated in Higress console
base_url="http://localhost:8080/v1" # Higress gateway address
)
# Call DeepSeek model
response_deepseek = client.chat.completions.create(
model="deepseek-r1",
messages=[{"role": "user", "content": "解释量子计算"}]
)
# Call QwQ model
response_qwq = client.chat.completions.create(
model="qwq-32b",
messages=[{"role": "user", "content": "写一首七言诗"}]
)5. QwQ‑32B Performance
QwQ‑32B delivers extremely fast token output on a single H20 card.
Advanced Features of Higress AI Gateway
1. Multi‑Model Service
Supports simultaneous deployment of multiple LLMs (e.g., DeepSeek, Qwen, self‑hosted models) with front‑end selection and fallback capabilities.
2. Consumer Authentication
API‑Key based multi‑tenant isolation, RBAC for fine‑grained permission control, and audit logging for compliance.
3. Model Auto‑Switch (Fallback)
If a model request fails, the gateway automatically falls back to an alternative model to ensure service continuity.
4. Token‑Level Rate Limiting
Limits token consumption per consumer or API key, preventing resource exhaustion and reducing costs.
5. Content Safety & Compliance
Integrates Alibaba Cloud content‑security services to filter harmful inputs/outputs and enforce industry‑specific compliance (finance, healthcare, etc.).
6. Semantic Caching
Caches LLM responses in Redis to reduce token costs and latency for repetitive queries.
7. Internet Search + Full‑Text Retrieval
Enhances LLMs with search‑augmented generation, retrieving full web pages rather than just snippets.
8. Model Observability
Provides detailed metrics (token consumption per consumer/model, rate‑limit stats, cache hits, security alerts) and integrates with Alibaba Cloud logging and tracing services.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
