Cloud Native 18 min read

From Legacy Monitoring to Modern Observability: A Cloud‑Native Journey

This article traces the 30‑year evolution of system monitoring, explains the differences between monitoring, APM and observability, outlines key practices for building an observability platform, and provides a step‑by‑step guide to implementing Prometheus + Grafana in a cloud‑native environment.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
From Legacy Monitoring to Modern Observability: A Cloud‑Native Journey

Evolution of Observability

Observability has progressed through four major phases:

Late 1990s – Client‑Server monitoring : Simple host and network metrics (CPU, memory, network I/O) were collected using first‑generation APM tools.

2000s – Application‑level tracing : Browser‑App‑DB three‑tier architectures and widespread Java usage introduced code‑level tracing and database tuning, leading to second‑generation APM solutions.

2005‑2010 – Distributed & virtualized environments : SOA/ESB architectures, virtual machines, and third‑party components required full‑link tracing and monitoring of virtual resources.

2010‑present – Cloud‑native micro‑services : Container orchestration and service meshes lengthen call paths, making fault isolation harder. Modern observability covers metrics, logs, traces, and events across the entire application lifecycle.

Observability timeline
Observability timeline

Monitoring vs. APM vs. Observability

Using an awareness‑understanding model, the three concepts map to distinct knowledge states:

Monitoring (known & understood) : Collect concrete metrics such as CPU utilization.

APM (known but not understood) : Add application‑level tracing to explain why a metric spikes.

Observability (unknown & not understood) : Correlate logs, traces, metrics, and events to uncover hidden root causes.

Awareness vs Understanding model
Awareness vs Understanding model

Key Pillars of an Observability Stack

The stack consists of three pillars: Logging , Tracing , and Metrics . Successful implementation requires:

Full‑stack coverage : Capture data from infrastructure, containers, cloud services, and end‑user devices.

Unified standards : Use Prometheus for metrics, OpenTelemetry (or OpenTracing) for traces, Fluentd / Loki for logs, and visualize with Grafana.

Data quality : Define schemas, filter noise, and apply sampling strategies (e.g., adaptive trace sampling) to ensure accurate analysis.

Observability data pillars
Observability data pillars

Observability Practice with Prometheus + Grafana

A typical open‑source observability platform combines:

Prometheus for metric collection from ECS, VPC, containers, and third‑party middleware.

Grafana for unified dashboards displaying the “golden triangle” (request volume, error rate, latency) and custom panels for user‑experience, application performance, container health, cloud services, and host nodes.

SkyWalking or Jaeger for distributed tracing.

ELK or Loki for log aggregation.

After adding data sources, Grafana automatically generates baseline dashboards (e.g., request volume, error rate). Teams can then create unified dashboards that overlay infrastructure, container, application, and user‑experience metrics for end‑to‑end performance monitoring.

Alibaba Cloud ARMS – One‑Stop Observability (Open‑Source Equivalent)

ARMS integrates the same open‑source components:

Infrastructure monitoring via Prometheus.

Application monitoring with Java probes and trace collection (compatible with OpenTelemetry SDKs).

User‑experience monitoring for mobile, frontend, and synthetic tests.

Unified alerting and root‑cause analysis presented through Insight.

Grafana‑based visualization across all data sources.

Enterprises can replicate these capabilities by assembling the open‑source stack described above.

Design Guidelines for a Full‑Stack Observability System

1. Data Collection

Full‑stack coverage : Collect logs, traces, and metrics from the OS layer, container runtime, cloud services, and end‑user devices.

Unified standards : Adopt Prometheus for metrics, OpenTelemetry for traces, and Fluentd/Loki for logs.

Data quality : Define a common schema, de‑duplicate events, and configure adaptive sampling (e.g., sample 1 % of normal traces, 100 % of error traces).

2. Data Analysis

Horizontal correlation : Link micro‑service calls, third‑party APIs, and cloud services via trace IDs.

Vertical mapping : Map trace spans to underlying container and host metrics.

Domain knowledge : Encode common troubleshooting paths (e.g., “high CPU → excessive INFO logs”) to accelerate root‑cause discovery.

3. Value Output

Unified visualization : Use Grafana to present metrics, traces, and logs on a single dashboard, leveraging tags to filter by service, environment, or business domain.

Collaboration (ChartOps) : Forward alerts to chat platforms (e.g., DingTalk, WeChat Work) for coordinated incident response.

Cloud‑service integration : Trigger auto‑scaling or load‑balancing actions directly from alert conditions.

Building a Unified Full‑Stack Dashboard

When constructing a comprehensive dashboard, organize panels by the following dimensions:

User‑experience : PV/UV, JavaScript error rate, First‑Contentful‑Paint, API success rate, Top‑N page performance.

Application performance : Golden three – request volume, error rate, latency – broken out per service.

Container layer : Pod CPU/Memory usage, restart count, deployment version.

Cloud services : Example – Kafka consumer lag, message throughput.

Host nodes : Node‑level CPU, memory, disk I/O, and running pod counts.

Prometheus can scrape cloud‑service metrics together with their tags (e.g., service=order, env=prod). Using Grafana’s globalview feature, multiple Prometheus instances can be queried simultaneously, enabling a single pane of glass for all layers.

Unified dashboard example
Unified dashboard example
User‑experience metrics
User‑experience metrics
Application performance metrics
Application performance metrics
Container pod metrics
Container pod metrics
Kafka consumer metrics
Kafka consumer metrics
Host node metrics
Host node metrics
MonitoringAPMPrometheusGrafanaARMS
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.