Mastering Observability in Alibaba Cloud Service Mesh ASM: Logs, Metrics, and Tracing
This guide explains how Alibaba Cloud Service Mesh ASM enables comprehensive observability for cloud‑native applications by configuring telemetry for logs, metrics, and distributed tracing, offering best‑practice recommendations, YAML examples, and integration with Prometheus, ARMS, and external tracing tools.
Observability Overview
As application systems become increasingly complex, maintaining steady, robust operation is challenging; parts of a system may degrade. Observability—understanding internal state by examining external outputs—helps detect failures, reduce mean‑time‑to‑recovery, and keep services reliable.
ASM’s Unified Observability Model
Alibaba Cloud Service Mesh (ASM) provides a standardized way to generate and collect converged observability data, supporting cloud‑native applications. It leverages the Service Mesh data‑plane proxy (Envoy) to capture logs, metrics, and traces.
Built‑in Best Practices for Telemetry CRD
Only one Telemetry object named default is allowed in the istio-system namespace.
Each namespace may have a single Telemetry with an empty selector and name default.
Use workload selectors to create workload‑specific overrides.
If two Telemetry objects select the same workload, execution order is undefined.
When the global Telemetry lacks metric configuration, metrics are disabled by default.
Initially ASM enables only SERVER‑side metrics to limit storage cost; CLIENT‑side metrics must be enabled manually if needed.
Log Collection
Aggregating service logs to a central system simplifies management and search. ASM offers log filtering and formatting. Services should write logs to stdout / stderr, and a log agent forwards them.
Log format rule (example) :
envoyFileAccessLog:</code>
<code> logFormat:</code>
<code> text: '{"bytes_received":"%BYTES_RECEIVED%","bytes_sent":"%BYTES_SENT%","duration":"%DURATION%","method":"%REQ(:METHOD)%","path":"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%","response_code":"%RESPONSE_CODE%","user_agent":"%REQ(USER-AGENT)%","x_forwarded_for":"%REQ(X-FORWARDED-FOR)%","app_service_name":"%UPSTREAM_CLUSTER%"}'</code>
<code> path: /dev/stdoutLog filtering example (only log responses with status code ≥ 400):
accessLogging:</code>
<code>- disabled: false</code>
<code> filter:</code>
<code> expression: response.code >= 400</code>
<code> providers:</code>
<code> - name: envoyControl‑Plane Log and Alerting
ASM can collect control‑plane logs (e.g., configuration push failures) and generate alerts. Enabling these alerts helps detect misconfigurations that could render sidecar proxies or gateways unavailable after a restart.
Metrics Collection
Metrics are a core observability dimension. Istio uses Prometheus; each Envoy sidecar emits numerous metrics. ASM provides a UI to define metric generation rules via the Telemetry CRD.
Example metric overrides (enabling SERVER‑side metrics, disabling CLIENT‑side by default):
metrics:</code>
<code>- overrides:</code>
<code> - disabled: true</code>
<code> match:</code>
<code> metric: ALL_METRICS</code>
<code> mode: CLIENT</code>
<code> - disabled: false</code>
<code> match:</code>
<code> metric: ALL_METRICS</code>
<code> mode: SERVER</code>
<code>providers:</code>
<code>- name: prometheusFirst‑time activation only enables SERVER‑side metrics to control cost; users can enable CLIENT‑side metrics as needed. Certain SERVER‑side metrics (e.g., REQUEST_COUNT, TCP_SENT_BYTES) are required for mesh topology visualisation.
Service Level Objectives (SLO) and Indicators (SLI)
ASM supports SLO definition through a UI that auto‑generates Prometheus rules. Typical SLI types include availability (HTTP 2xx/3xx) and latency (custom thresholds). Example SLOs:
Average QPS > 100k per minute
99% of requests latency < 500 ms
99% of bandwidth > 200 MB/s
Generated alerts appear in Alertmanager when SLO thresholds are breached.
Distributed Tracing
Tracing provides end‑to‑end visibility of request flows. ASM integrates with Jaeger or Zipkin via the Telemetry CRD. Users must propagate tracing headers (e.g., x-request-id, x-b3-traceid, x-b3-spanid, etc.) so that sidecars can associate spans correctly.
Example tracing configuration:
tracing:</code>
<code>- customTags:</code>
<code> mytag1:</code>
<code> literal:</code>
<code> value: fixedvalue</code>
<code> mytag2:</code>
<code> header:</code>
<code> name: myheader1</code>
<code> defaultValue: value1</code>
<code> mytag3:</code>
<code> environment:</code>
<code> name: myenv1</code>
<code> defaultValue: value1</code>
<code> providers:</code>
<code> - name: zipkin</code>
<code> randomSamplingPercentage: 90Collected trace data can be sent to Alibaba Cloud’s managed tracing service or to self‑hosted solutions like Zipkin or Jaeger.
Mesh Topology Visualization
ASM includes a built‑in mesh topology view that visualises services and their configurations, relying on the metrics described above.
Conclusion
Observability is essential for cloud‑native applications. By using ASM’s unified telemetry model—covering logs, metrics, and tracing—teams can achieve non‑intrusive, standardized data collection, reduce operational overhead, and improve reliability and performance of their service mesh deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
