How to Build an AI Cache WASM Plugin for Higress Gateway
This guide explains how to set up a Higress gateway, compile a WebAssembly AI cache plugin, integrate Redis and DashVector for semantic caching of large‑model requests, and provides complete configuration and code examples for end‑to‑end deployment.
Background
The Higress AI Gateway Challenge requires a WebAssembly (Wasm) plugin that intercepts the four HTTP phases (requestHeader, requestBody, responseHeader, responseBody) and implements a semantic cache for large‑model (OpenAI‑style) queries to reduce token usage.
Gateway Environment Setup
Online enterprise Higress setup
Apply for a free trial and create the required resources: Qwen model endpoint, Redis service, and DashScope service.
Create service entries for the model, Redis, and DashVector.
Configure routing to forward requests to the Qwen model.
Enable the ai-proxy plugin from the Higress plugin marketplace and provide the model API key and token.
Local testing environment
Run Higress together with LobeChat using Docker Compose. Example docker‑compose.yml:
version: '3.9'
networks:
higress-net:
external: false
services:
higress:
image: registry.cn-hangzhou.aliyuncs.com/ztygw/aio-redis:1.4.1-rc.1
environment:
- GATEWAY_COMPONENT_LOG_LEVEL=misc:error,wasm:debug
- CONFIG_TEMPLATE=ai-proxy
- DEFAULT_AI_SERVICE=qwen
- DASHSCOPE_API_KEY=[YOUR_KEY]
ports:
- "9080:8080"
- "9001:8001"
volumes:
- ./data:/data
- ./log:/var/log/higress/
restart: always
lobechat:
image: lobehub/lobe-chat
environment:
- CODE=123456ed
- OPENAI_API_KEY=unused
- OPENAI_PROXY_URL=http://higress:8080/v1
ports:
- "3210:3210"
restart: alwaysCompiling and Uploading the AI Cache Plugin
Clone the Higress repository and build the Wasm plugin:
git clone https://github.com/alibaba/higress.git
cd higress/plugins/wasm-go
PLUGIN_NAME=ai-cache EXTRA_TAGS=proxy_wasm_version_0_2_100 make buildPush the resulting plugin image to a container registry and update the local test configuration to reference the new image version.
Text Vector Request and Cache Logic
The plugin processes each query with the following workflow:
Match the query string against keys stored in Redis ( redisSearchHandler). If an exact match is found, return the cached response.
If not matched, call the text‑embedding API to obtain a query_embedding.
Search the embedding in DashVector (ANN search). If a similar key with a distance lower than 0.1 is found, retrieve its result from Redis.
If no suitable match, store the new query_embedding in DashVector for future use.
During the response phase, insert the query and LLM result into Redis for caching.
Key code snippets:
// Redis lookup
func redisSearchHandler(key string, ctx wrapper.HttpContext, config PluginConfig, log wrapper.Log, stream bool, ifUseEmbedding bool) error {
err := config.redisClient.Get(config.CacheKeyPrefix+key, func(response resp.Value) {
if err := response.Error(); err == nil && !response.IsNull() {
log.Warnf("cache hit, key:%s", key)
handleCacheHit(key, response, stream, ctx, config, log)
} else {
log.Warnf("cache miss, key:%s", key)
if ifUseEmbedding {
handleCacheMiss(key, err, response, ctx, config, log, key, stream)
} else {
proxywasm.ResumeHttpRequest()
}
}
})
return err
}
// Fetch embeddings
func fetchAndProcessEmbeddings(key string, ctx wrapper.HttpContext, config PluginConfig, log wrapper.Log, queryString string, stream bool) {
Emb_url, Emb_requestBody, Emb_headers := ConstructTextEmbeddingParameters(&config, log, []string{queryString})
config.DashVectorInfo.DashScopeClient.Post(Emb_url, Emb_headers, Emb_requestBody,
func(statusCode int, responseHeaders http.Header, responseBody []byte) {
if statusCode != 200 {
ctx.SetContext(QueryEmbeddingKey, nil)
proxywasm.ResumeHttpRequest()
return
}
processFetchedEmbeddings(key, responseBody, ctx, config, log, stream)
}, 10000)
}
// Vector search and result handling
func performQueryAndRespond(key string, text_embedding []float64, ctx wrapper.HttpContext, config PluginConfig, log wrapper.Log, stream bool) {
vector_url, vector_request, vector_headers, err := ConstructEmbeddingQueryParameters(config, text_embedding)
if err != nil {
log.Errorf("Failed to perform query: %v", err)
proxywasm.ResumeHttpRequest()
return
}
config.DashVectorInfo.DashVectorClient.Post(vector_url, vector_headers, vector_request,
func(statusCode int, responseHeaders http.Header, responseBody []byte) {
query_resp, err := ParseQueryResponse(responseBody)
if err != nil || len(query_resp.Output) == 0 {
uploadQueryEmbedding(ctx, config, log, key, text_embedding)
return
}
most_similar_key := query_resp.Output[0].Fields["query"].(string)
most_similar_score := query_resp.Output[0].Score
if most_similar_score < 0.1 {
redisSearchHandler(most_similar_key, ctx, config, log, stream, false)
} else {
uploadQueryEmbedding(ctx, config, log, key, text_embedding)
proxywasm.ResumeHttpRequest()
}
}, 100000)
}Configuration Files
YAML configuration required by the plugin (replace placeholder values with actual credentials):
Dash:
dashScopeKey: "YOUR_DASHSCOPE_KEY"
dashScopeServiceName: "qwen"
dashVectorCollection: "YOUR_CLUSTER_NAME"
dashVectorEnd: "YOUR_VECTOR_END"
dashVectorKey: "YOUR_DASHVECTOR_KEY"
dashVectorServiceName: "DashVector.dns"
sessionID: "XXX"
redis:
serviceName: "redis.static"
timeout: 2000External Service Registration
Declare external services in the plugin configuration and instantiate them in ParseConfig:
type PluginConfig struct {
DashVectorClient wrapper.HttpClient `yaml:"-" json:"-"`
DashScopeClient wrapper.HttpClient `yaml:"-" json:"-"`
redisClient wrapper.RedisClient `yaml:"-" json:"-"`
}
func ParseConfig(c *Config) {
c.DashVectorInfo.DashVectorClient = wrapper.NewClusterClient(wrapper.DnsCluster{ServiceName: c.DashVectorInfo.DashVectorServiceName, Port: 443, Domain: c.DashVectorInfo.DashVectorAuthApiEnd})
c.DashVectorInfo.DashScopeClient = wrapper.NewClusterClient(wrapper.DnsCluster{ServiceName: c.DashVectorInfo.DashScopeServiceName, Port: 443, Domain: "dashscope.aliyuncs.com"})
}Limitations
The cache logic must run in callbacks that return types.Action. Streaming response handlers (e.g., onHttpResponseBody) cannot block, so additional signalling is required for full asynchronous support.
Repository
Full source code: https://github.com/Suchun-sv/ai-cache-Demo
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
