Cloud Native 10 min read

How to Use Cloud‑Native Gateway Observability for Rapid Fault Detection and Root‑Cause Analysis

This article explains how observability—through logging, metrics, and distributed tracing—enables cloud‑native gateways to detect failures early, pinpoint problematic routes and services with Prometheus and SLS logs, and finally trace root causes using Arms xtrace, improving system reliability and stability.

Alibaba Cloud Native

Sep 9, 2023

How to Use Cloud‑Native Gateway Observability for Rapid Fault Detection and Root‑Cause Analysis

Observability Fundamentals

Observability is the capability to monitor, understand, and debug the runtime state, performance, and behavior of distributed systems. It is built on three pillars:

Logging : Structured records of events generated by applications. Common platforms include ELK (Elasticsearch‑Logstash‑Kibana) and Splunk.

Metrics : Quantitative measurements such as CPU usage, memory consumption, request latency, etc. Prometheus and InfluxDB are typical metric stores.

Distributed Tracing : End‑to‑end request identifiers that allow the reconstruction of call chains across services. Popular implementations are Zipkin and Apache SkyWalking.

Observability Stack for a Cloud‑Native Gateway

The gateway integrates Alibaba Cloud services:

SLS (Log Service) for centralized, structured log collection.

ARMS (Application Real‑Time Monitoring Service) providing Prometheus‑compatible metrics and xtrace‑based distributed tracing.

The following workflow demonstrates fault discovery and root‑cause analysis using these components.

Fault‑Discovery Workflow

Alert generated by the gateway instance (configured at the instance level).

Prometheus dashboard in ARMS is consulted to locate the failing route and upstream service.

SLS log console is queried for detailed error information.

Arms xtrace is used to trace the failing request and identify the exact service error.

1. Alert Configuration

Alert rules are defined per gateway instance. Notification channels include email, SMS, and phone. When the failure rate exceeds a threshold, an email is sent to the operator.

2. Initial Diagnosis with ARMS Prometheus

In the ARMS → Business Monitoring → Global Dashboard, the gateway failure rate is compared with upstream services. The dashboard shows:

Overall gateway failure rate > upstream failure rate, indicating errors at both layers.

Only the route named spring has a non‑zero failure rate.

Only the service springboot-svc-1.app-system.svc.cluster.local reports failures.

Both the route and service return 5xx status codes, confirming that the spring route points to a faulty upstream service.

3. Detailed Log Analysis with SLS

In the SLS console, clicking the response_code field automatically generates a query that reveals three response codes during the incident: 200, 404, and 500.

Filtering for 404 shows requests that did not match any route.

Filtering for 500 displays the associated route ( spring) and service ( springboot-svc-1.app-system.svc.cluster.local), confirming that the backend service is the source of the error.

4. Root‑Cause Identification via Arms xtrace

After enabling xtrace tracing on the gateway, the trace ID of a failing request is obtained from the SLS logs. Searching this trace ID in the xtrace UI reveals a call‑chain where the service springboot-svc-4-2 throws an error, pinpointing the exact root cause.

Test Environment Overview

An ACK Kubernetes cluster hosts several Spring Boot services. The gateway routes traffic to these services, including a deliberately crashed instance springboot-svc-4-2. Three request patterns are exercised:

Normal request routed to httpbin.

Request that fails at the gateway because no route matches (produces 404).

Request that reaches an upstream Spring service and triggers a 5xx error.

The gateway’s error rate spikes when the upstream service fails, providing a clear signal for the observability pipeline.

Summary

By configuring gateway‑level alerts, leveraging ARMS Prometheus metrics, querying structured SLS logs, and analyzing distributed traces with Arms xtrace, operators can automatically detect faults, quickly isolate the problematic route and service, and determine the precise root cause (in this case, an error in springboot-svc-4-2). This end‑to‑end observability pipeline improves reliability and reduces mean time to resolution.

Technical reference for installing the ARMS Java agent on ACK: https://help.aliyun.com/zh/arms/application-monitoring/getting-started/install-arms-agent-for-java-applications-deployed-in-ack?spm=a2c4g.11186623.0.i6#arms-cs-k8s-java

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices Logging gateway

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.