Operations 12 min read

How to Quickly Diagnose Error and Performance Issues in Cloud‑Native Applications

This article outlines a comprehensive approach to identifying and resolving both error‑related and slow‑request problems in online systems by leveraging trace data, log correlation, method‑stack analysis, unified entity models, and large‑language‑model assistance to accelerate root‑cause diagnosis.

Alibaba Cloud Observability

Dec 30, 2024

How to Quickly Diagnose Error and Performance Issues in Cloud‑Native Applications

Online Application Risk: “Error” vs “Slow”

From a development and operations perspective, online applications face two major risk categories: “error” problems caused by unexpected program behavior (e.g., wrong JVM class versions, mis‑configured environments) and “slow” problems caused by resource shortages (e.g., CPU spikes, full connection pools, memory leaks, frequent GC).

Both categories require rapid loss‑mitigation, root‑cause location, and risk elimination, yet the complex dependencies between services make pinpointing the faulty node challenging.

Effective Diagnosis Framework

Locate the abnormal request object using trace links and associated data: Trace tracking follows a request’s path across distributed systems, correlating logs, method stacks, parameters, and exception traces to achieve line‑level code location. Example: an app‑side timeout is traced to a specific method that exceeds 3 seconds.

Analyze the true root cause via entity data linked to the abnormal object: Errors or slowness often stem from broader changes such as untested releases, infrastructure failures, or traffic spikes. Cross‑domain entity association (e.g., linking a slow SQL to a saturated database connection pool) is required for deep diagnosis.

Combine high‑quality data, domain knowledge, and large‑model algorithms for intelligent root‑cause diagnosis: Building a unified observability platform that collects end‑to‑end stack data, constructs a semantic entity‑relationship model, and leverages LLMs with domain knowledge enables automated attribution for classic ops scenarios.

Slow‑Request Diagnosis

The key is to find the code line that truly consumes time. Traditional instrumentation often misses full local method stacks, leaving developers aware of a slow endpoint but blind to internal bottlenecks.

Alibaba Cloud ARMS’s “code hotspot” feature continuously captures complete local method stacks for slow requests, allowing line‑level identification. Typical steps:

Filter calls by application, interface, and latency to surface slow‑request patterns.

Use the waterfall view to pinpoint the service interface that dominates total latency.

Inspect the recorded method stack to locate the exact code line and guide optimization.

Once the slow method and its upstream/downstream calls are known, developers can modify logic to resolve the issue.

Error‑Request Diagnosis

Errors are divided into service errors (e.g., HTTP 5xx, RuntimeException) and business errors (e.g., coupon validation failure). Diagnosis steps include:

Trace‑log bi‑directional correlation: For service errors, follow the call chain to the failing service and view related logs. For business errors, search logs for business keywords, then use the associated TraceId to trace upstream/downstream context.

Trace‑exception stack linking: In Java, exceptions contain detailed stack traces. By linking the TraceId to the exception stack, developers can quickly locate the problematic code.

Trace‑request parameter association: Input parameters can affect execution branches; therefore, correlating request parameters helps diagnose ambiguous cases. Output size is usually recorded rather than full payload.

Combining trace, logs, exception stacks, and parameters enables precise identification of each request’s anomaly and improves error‑diagnosis efficiency.

Unified Entity Relationship Model for Real‑Root‑Cause Analysis

Beyond the immediate process, true root causes often involve broader entities such as host hardware, pods, databases, K8s workloads, CI/CD jobs, or even code commit authors. By constructing a cross‑domain entity graph, any change (e.g., a database index alteration) can be traced to downstream impacts (e.g., massive slow SQL, user order failures). This model breaks data silos and lays the foundation for intelligent diagnosis.

Intelligent Diagnosis with Large Language Models

High‑quality, multi‑modal observability data combined with LLMs and domain‑specific knowledge bases enables automated root‑cause identification. Recent advances include:

Broader, higher‑quality data collection via OpenTelemetry and standardized observability pipelines.

LLM‑driven algorithms that surpass rule‑based or statistical methods, especially when augmented with Retrieval‑Augmented Generation (RAG) and workflow orchestration.

Alibaba Cloud ARMS has already deployed an LLM‑powered single‑trace diagnosis that fuses call chains, method stacks, exception stacks, SQL, and metrics to pinpoint error or slow causes and suggest optimizations.

Example: a request failure was traced to the queryMemberCoupons method executing a SQL with an empty IN clause, causing a syntax error. The LLM‑enhanced Copilot surface this root cause and recommendation, demonstrating the potential of AI‑assisted operations despite current latency and stability challenges.

APM LLM trace root cause analysis Performance debugging

Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.