Cloud Native 14 min read

How to Enable LLM Traffic Observability with Alibaba Cloud Service Mesh (ASM)

This guide explains how to use Alibaba Cloud Service Mesh (ASM) to add infrastructure‑level observability for large language model (LLM) traffic, covering custom access‑log fields, new Prometheus metrics for token usage, and adding model dimensions to native Istio metrics, with step‑by‑step commands and configuration examples.

Alibaba Cloud Infrastructure

Jan 3, 2025

How to Enable LLM Traffic Observability with Alibaba Cloud Service Mesh (ASM)

Background

Effective observability is essential for building efficient, stable distributed applications, and it becomes even more critical for LLM‑driven services. Over time, observability logic has moved from manual code instrumentation to framework‑level support and finally to the infrastructure layer provided by service meshes.

ASM‑Based LLM Observability

Alibaba Cloud Service Mesh (ASM) now offers infrastructure‑level LLM traffic management and observability without requiring a specific language SDK or changes to application call patterns. By configuring ASM, users can obtain transparent traffic routing and detailed observability data, which is crucial for both service stability and cost optimization.

Observability Features Provided by ASM

ASM’s observability consists of three parts: access logs, monitoring metrics, and tracing. The default log and metric capabilities do not expose LLM‑specific information (e.g., model name, token counts). ASM therefore enhances these two areas.

1. Enhanced Access Logs

ASM allows custom access‑log formats that can include the following fields:

request_model                FILTER_STATE(wasm.asm.llmproxy.request_model:PLAIN)
request_prompt_tokens        FILTER_STATE(wasm.asm.llmproxy.request_prompt_tokens:PLAIN)
request_completion_tokens   FILTER_STATE(wasm.asm.llmproxy.request_completion_tokens:PLAIN)

These fields represent the model used for the request, the number of input (prompt) tokens, and the number of output (completion) tokens. Example log entries after formatting:

{
  "duration": "7640",
  "response_code": "200",
  "authority_for": "dashscope.aliyuncs.com",
  "request_model": "qwen-1.8b-chat",
  "request_prompt_tokens": "3",
  "request_completion_tokens": "55"
}

{
  "duration": "2759",
  "response_code": "200",
  "authority_for": "dashscope.aliyuncs.com",
  "request_model": "qwen-turbo",
  "request_prompt_tokens": "11",
  "request_completion_tokens": "90"
}

These logs can be collected by Alibaba Cloud Log Service for alerting and dashboarding.

2. New Prometheus Metrics for Token Consumption

ASM adds two metrics: asm_llm_proxy_prompt_tokens: number of input tokens. asm_llm_proxy_completion_tokens: number of output tokens.

Both metrics carry four default dimensions:

llmproxy_source_workload

llmproxy_source_workload_namespace

llmproxy_destination_service

llmproxy_model

To enable them, create a ConfigMap that defines the tag extraction rules and patch the workload to use the ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: asm-llm-proxy-bootstrap-config
data:
  custom_bootstrap.json: |
    "stats_config": {
      "stats_tags":[
        {"tag_name":"llmproxy_source_workload","regex":"(\\|llmproxy_source_workload=([^|]*))"},
        {"tag_name":"llmproxy_source_workload_namespace","regex":"(\\|llmproxy_source_workload_namespace=([^|]*))"},
        {"tag_name":"llmproxy_destination_service","regex":"(\\|llmproxy_destination_service=([^|]*))"},
        {"tag_name":"llmproxy_model","regex":"(\\|llmproxy_model=([^|]*))"}
      ]
    }

Apply the ConfigMap and patch the deployment:

kubectl apply -f asm-llm-proxy-bootstrap-config.yaml
kubectl patch deployment sleep -p '{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/bootstrapOverride":"asm-llm-proxy-bootstrap-config"}}}}}'

Query the metrics from the sidecar:

kubectl exec deployments/sleep -it -c istio-proxy -- curl localhost:15090/stats/prometheus | grep llmproxy

Sample output shows token counts per model:

asm_llm_proxy_prompt_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen-1.8b-chat"} 3
asm_llm_proxy_completion_tokens{llmproxy_source_workload="sleep",llmproxy_source_workload_namespace="default",llmproxy_destination_service="dashscope.aliyuncs.com",llmproxy_model="qwen-turbo"} 85

3. Adding Model Dimension to Native Istio Metrics

Native Istio metrics (e.g., istio_requests_total) lack LLM‑specific dimensions. ASM allows adding a custom dimension “model” to any metric. Using the REQUEST_COUNT metric as an example, the UI steps are:

Open the observability configuration page.

Select the metric (REQUEST_COUNT) and edit its dimensions.

Add a new dimension named model with the value source filter_state["wasm.asm.llmproxy.request_model"].

After applying the change, the model appears in the Istio request metric:

istio_requests_total{... ,model="qwen-1.8b-chat"} 1
istio_requests_total{... ,model="qwen-turbo"} 1

This enables analysis such as model‑wise success rates or average latency.

Testing Commands

Send LLM requests via the sidecar:

kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --data '{"messages":[{"role":"user","content":"Please introduce yourself"}]}'

kubectl exec deployment/sleep -it -- curl --location 'http://dashscope.aliyuncs.com' \
  --header 'Content-Type: application/json' \
  --header 'user-type: subscriber' \
  --data '{"messages":[{"role":"user","content":"Please introduce yourself"}]}'

View the last two lines of the access log:

kubectl logs deployments/sleep -c istio-proxy | tail -2

Query Prometheus metrics for token usage or request counts as shown above.

Conclusion

The article demonstrates how ASM extends its existing HTTP/TCP observability stack with LLM‑specific logs and metrics, providing fine‑grained insight into model usage and token consumption. These capabilities form the foundation for more advanced features such as LLM request caching and token‑based rate limiting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Observability kubernetes Metrics prometheus Service Mesh ASM access logs

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.