Mastering Error and Latency Diagnosis for Online Applications
This article presents a systematic root‑cause diagnosis framework for online applications, covering how to identify and resolve both error ("wrong") and performance ("slow") problems using trace links, associated data, high‑quality observability, and large‑language‑model‑driven intelligence.
Online Application Risk: “Error” and “Slow” Issues
Online services typically encounter two major risk categories: “error” (program behavior deviates from expectations, e.g., wrong JVM class version, exception branches, mis‑configured environment) and “slow” (performance degradation caused by resource shortages such as CPU spikes, connection‑pool exhaustion, memory leaks, etc.).
From a development‑operations perspective, rapid loss mitigation, root‑cause localization, and hazard elimination are essential, yet complex application dependencies make pinpointing the faulty node challenging.
Based on a decade of APM product development and customer support experience, a practical root‑cause diagnosis solution for error‑slow requests has been distilled into three key steps:
Locate the abnormal request object using trace links and associated data : Trace tracking follows a request across distributed components, correlating logs, method stacks, parameters, and exception traces to achieve line‑level code localization.
Analyze the true root cause via entity data linked to the abnormal object : Errors often stem from untested releases, infrastructure failures, or traffic spikes. Building cross‑domain entity relationships (e.g., linking slow SQL to a saturated connection pool) uncovers deeper causes.
Leverage high‑quality data, domain knowledge, and large‑model algorithms for intelligent diagnosis : A unified observability platform collects full‑stack multimodal data, constructs a semantic entity‑relationship model, and combines LLM reasoning with a domain knowledge base to automate root‑cause attribution.
Slow‑Request Diagnosis: Trace + Method Stack
Identifying the critical path that dominates total latency and drilling down to method‑level (or line‑level) details is crucial. Traditional instrumentation often lacks complete local method stacks, making it hard for developers to pinpoint slow code inside an interface.
Alibaba Cloud ARMS’s continuous profiling – code hotspot feature automatically captures the full local method stack for slow requests, enabling line‑level code location.
Typical workflow:
Filter calls by application, interface, and latency; examine distribution to spot single‑machine anomalies.
Use waterfall charts to pinpoint the service interface that dominates latency.
Reference the recorded code hotspot to obtain the exact slow code line and guide optimization.
Error‑Request Diagnosis: Trace + Logs + Exception Stack / Parameters
Errors split into service‑level exceptions (e.g., HTTP 5xx, RuntimeException) and business‑level failures (e.g., coupon expired). Diagnosis involves:
Bidirectional trace‑log correlation : For service errors, locate the failing call chain and related logs; for business errors, search logs for business keywords and backtrack via TraceId.
Trace‑linked exception stack : Java exceptions contain detailed stack traces that can be associated with the request’s TraceId.
Trace‑linked request parameters : Input parameters influence execution paths; output size is usually recorded rather than full payload.
Building a Unified Entity Relationship Model
Beyond immediate trace data, broader entity associations—such as host instances, databases, K8s workloads, CI/CD jobs, and Git commits—form a comprehensive end‑to‑end observability graph. Changes in any entity can cascade downstream, e.g., a database index change causing massive slow SQL and downstream order failures.
Intelligent Root‑Cause Diagnosis with LLMs
Combining high‑quality multimodal data, domain expertise, and large‑language‑model (LLM) algorithms enables automated, accurate diagnosis of error and latency incidents. Recent advances include:
Broader, higher‑quality data collection via OpenTelemetry and unified standards.
LLM‑driven multi‑agent workflows that integrate RAG knowledge bases to reduce hallucinations and improve precision.
Alibaba Cloud ARMS now offers LLM‑based single‑trace intelligent diagnosis, aggregating call chains, method stacks, exception stacks, SQL, and metrics to pinpoint root causes and suggest optimizations.
Example: a request to /coupon/coupon/member/list failed because the generated SQL contained an empty IN clause, causing a syntax error.
Using Copilot‑style assistance can further accelerate diagnosis, though challenges such as latency and output stability remain.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
