Operations 12 min read

How to Quickly Diagnose Error and Latency Issues in Cloud‑Native Applications

This article outlines a practical, end‑to‑end approach for identifying and resolving both error‑related and slow‑request problems in online systems by leveraging trace links, correlated logs, entity relationships, and large‑language‑model‑driven analysis to achieve rapid root‑cause isolation.

Alibaba Cloud Native

Dec 24, 2024

How to Quickly Diagnose Error and Latency Issues in Cloud‑Native Applications

Understanding "Error" and "Slow" Risks in Online Applications

Online services encounter two major risk categories: "error" (unexpected program behavior such as wrong JVM class versions, mis‑configured environments, or runtime exceptions) and "slow" (resource shortages like CPU spikes, full DB connection pools, memory leaks, or GC pauses). Both require fast loss‑mitigation, root‑cause location, and remediation.

Effective Diagnosis Framework

Locate the abnormal request object via trace links and associated data: Distributed tracing records the request path, logs, stack traces, parameters, and exceptions, enabling line‑level pinpointing. Example: an app‑side order timeout is traced to service B’s method C exceeding 3 seconds.

Analyze the true root cause through entity‑level correlation: Errors often stem from untested releases, infrastructure failures, or traffic spikes. By linking the abnormal object to broader entities (e.g., database connection pool saturation caused by another service’s heavy query), deeper causes are uncovered.

Combine high‑quality data, domain knowledge, and large‑model algorithms for intelligent diagnosis: A unified observability platform collects full‑stack multimodal data, builds a semantic entity‑relationship model, and employs LLMs with a domain knowledge base to automate root‑cause attribution for error/slow scenarios.

Trace‑Based Slow‑Request Diagnosis

Slow‑request analysis focuses on identifying the code line that consumes the most time. Traditional tracing via instrumentation often lacks complete local method stacks, making it hard for developers to see inner‑method latencies.

Alibaba Cloud ARMS offers a "continuous profiling – code hotspot" feature that automatically captures full local method stacks for slow requests, enabling line‑level diagnosis.

Filter call chains by application, interface, and latency to find candidate slow requests and examine distribution patterns (e.g., single‑machine concentration).

Use the waterfall view to locate the critical service interface that dominates overall latency.

Inspect the recorded code hotspot to obtain the exact method line responsible for the slowdown and guide code optimization.

Error‑Request Diagnosis

Error requests are split into service errors (e.g., HTTP 5xx, RuntimeException) and business errors (e.g., coupon validation failure). Diagnosis steps include:

Bidirectional trace‑log correlation: For service errors, start from the failing call chain and jump to related logs; for business errors, search logs for business keywords, then use the TraceId to trace upstream/downstream context.

Trace‑linked exception stack: Java exceptions contain detailed stack traces; associating the TraceId with the exception stack accelerates fault isolation.

Trace‑linked request parameters: Input parameters can affect execution paths; capturing them helps decide if a request is abnormal. Response payloads are usually summarized by size due to volume.

Building a Unified Entity Relationship Model

Beyond trace data, real root causes often involve cross‑domain entities such as host hardware, gateways, databases, K8s workloads, CI/CD pipelines, or Git commit authors. By constructing a comprehensive entity graph, any change (e.g., a DB index alteration) can be traced to downstream effects (e.g., slow SQL, failed orders), enabling faster, more accurate diagnosis.

Intelligent Root‑Cause Diagnosis with Large Language Models

Combining high‑quality multimodal data, domain expertise, and LLMs (including multi‑agent workflows and Retrieval‑Augmented Generation) transforms traditional rule‑based diagnostics into a more adaptable, accurate system. Alibaba Cloud ARMS has already deployed an LLM‑driven “Copilot” that ingests call chains, method stacks, exception traces, SQL, and metrics to automatically identify error or latency causes and suggest optimizations.

Example: a request failed because the /coupon/coupon/member/list endpoint executed a SQL query with an empty IN clause, causing a syntax error. The Copilot pinpointed the offending method and provided the corrective insight.

While the current Copilot still faces challenges such as inference latency and occasional unstable outputs, ongoing data, knowledge, and algorithm improvements promise a robust, intelligent root‑cause diagnosis pipeline for future monitoring and impact‑analysis scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native APM LLM Observability Tracing Root Cause Analysis

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.