How to Quickly Pinpoint Error and Slow Traces with Alibaba Cloud ARMS
This guide explains how Alibaba Cloud's ARMS error/slow trace analysis feature can automatically compare abnormal and normal traces to identify root causes such as host, interface, slow SQL, or message‑queue issues, providing step‑by‑step examples for real‑world e‑commerce scenarios.
Engineer "A" inherits the operation of a core e‑commerce system and encounters intermittent errors and occasional long‑latency calls that are hard to reproduce. Traditional debugging relies on expert intuition, which is impractical for a large, long‑lived codebase with hundreds of contributors.
The solution is Alibaba Cloud Application Real‑Time Monitoring Service (ARMS) error/slow trace analysis. The feature automatically compares abnormal (error/slow) traces with normal traces across multiple dimensions, revealing shared characteristics without requiring prior expertise.
Core Principles
"Compare error/slow traces with normal traces in the same system, identify distinguishing features, and guide users to explore until the root cause is found."
ARMS can analyze any trace dimension that the call chain records, enabling detection of host anomalies, interface failures, slow SQL, message‑queue problems, and more.
Typical Workflow
Collect a batch of error/slow traces and a comparable batch of normal traces.
ARMS performs statistical comparison on each dimension (service name, IP, span name, etc.).
Dimensions with significant differences are highlighted as potential root causes.
Example: 1,000 error traces and 1,000 normal traces show that almost all error traces involve serviceName="mall-gateway", IP 10.0.0.47, and span /components/api/v1/mall/product. These components become the focus of further investigation.
Step‑by‑Step Demo
Error Trace Investigation
Step 1: Identify a time window where HTTP errors spike (e.g., 13:21‑13:28 for mall-gateway).
Step 2: Filter the trace list to that window.
Step 3: Open the error‑trace analysis page; ARMS shows three key dimensions: span name, IP 10.0.0.47, and IP 10.0.0.37.
Step 3.1‑3.3: Drill down each dimension with queries such as
serviceName="mall-gateway" AND spanName="/components/api/v1/mall/product" AND ip="10.0.0.47". All calls under this filter are errors, confirming the host‑interface pair as faulty.
Slow Trace Investigation
Step 1: Detect a period (13:40‑13:49) where mall-user-server has many >5 s calls.
Step 2: Adjust the latency threshold to 5 s and filter.
Step 3: Identify slow spans /components/api/v1/local/success, /components/api/v1/http/success and host IP 10.0.0.44.
Further queries confirm each span consistently exceeds 5 s, pinpointing the slow interfaces and the host they run on.
After locating the problematic host or interface, engineers can review recent code changes, resource usage, or downstream dependencies to resolve the issue.
Best Practices
Iteratively compare abnormal and normal traces until the root cause converges.
Leverage custom filter conditions to focus on specific services, spans, or IPs.
Combine trace analysis with code‑change logs and infrastructure metrics for comprehensive debugging.
Using ARMS error/slow trace analysis reduces reliance on intuition, shortens incident‑response time during high‑traffic events (e.g., 618, Double 11), and allows teams to proactively optimize error‑prone or latency‑heavy components.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
