From Legacy Monitoring to Modern Observability: A Cloud‑Native Journey
This article traces the 30‑year evolution of system monitoring, explains the differences between monitoring, APM and observability, outlines key practices for building an observability platform, and provides a step‑by‑step guide to implementing Prometheus + Grafana in a cloud‑native environment.
Evolution of Observability
Observability has progressed through four major phases:
Late 1990s – Client‑Server monitoring : Simple host and network metrics (CPU, memory, network I/O) were collected using first‑generation APM tools.
2000s – Application‑level tracing : Browser‑App‑DB three‑tier architectures and widespread Java usage introduced code‑level tracing and database tuning, leading to second‑generation APM solutions.
2005‑2010 – Distributed & virtualized environments : SOA/ESB architectures, virtual machines, and third‑party components required full‑link tracing and monitoring of virtual resources.
2010‑present – Cloud‑native micro‑services : Container orchestration and service meshes lengthen call paths, making fault isolation harder. Modern observability covers metrics, logs, traces, and events across the entire application lifecycle.
Monitoring vs. APM vs. Observability
Using an awareness‑understanding model, the three concepts map to distinct knowledge states:
Monitoring (known & understood) : Collect concrete metrics such as CPU utilization.
APM (known but not understood) : Add application‑level tracing to explain why a metric spikes.
Observability (unknown & not understood) : Correlate logs, traces, metrics, and events to uncover hidden root causes.
Key Pillars of an Observability Stack
The stack consists of three pillars: Logging , Tracing , and Metrics . Successful implementation requires:
Full‑stack coverage : Capture data from infrastructure, containers, cloud services, and end‑user devices.
Unified standards : Use Prometheus for metrics, OpenTelemetry (or OpenTracing) for traces, Fluentd / Loki for logs, and visualize with Grafana.
Data quality : Define schemas, filter noise, and apply sampling strategies (e.g., adaptive trace sampling) to ensure accurate analysis.
Observability Practice with Prometheus + Grafana
A typical open‑source observability platform combines:
Prometheus for metric collection from ECS, VPC, containers, and third‑party middleware.
Grafana for unified dashboards displaying the “golden triangle” (request volume, error rate, latency) and custom panels for user‑experience, application performance, container health, cloud services, and host nodes.
SkyWalking or Jaeger for distributed tracing.
ELK or Loki for log aggregation.
After adding data sources, Grafana automatically generates baseline dashboards (e.g., request volume, error rate). Teams can then create unified dashboards that overlay infrastructure, container, application, and user‑experience metrics for end‑to‑end performance monitoring.
Alibaba Cloud ARMS – One‑Stop Observability (Open‑Source Equivalent)
ARMS integrates the same open‑source components:
Infrastructure monitoring via Prometheus.
Application monitoring with Java probes and trace collection (compatible with OpenTelemetry SDKs).
User‑experience monitoring for mobile, frontend, and synthetic tests.
Unified alerting and root‑cause analysis presented through Insight.
Grafana‑based visualization across all data sources.
Enterprises can replicate these capabilities by assembling the open‑source stack described above.
Design Guidelines for a Full‑Stack Observability System
1. Data Collection
Full‑stack coverage : Collect logs, traces, and metrics from the OS layer, container runtime, cloud services, and end‑user devices.
Unified standards : Adopt Prometheus for metrics, OpenTelemetry for traces, and Fluentd/Loki for logs.
Data quality : Define a common schema, de‑duplicate events, and configure adaptive sampling (e.g., sample 1 % of normal traces, 100 % of error traces).
2. Data Analysis
Horizontal correlation : Link micro‑service calls, third‑party APIs, and cloud services via trace IDs.
Vertical mapping : Map trace spans to underlying container and host metrics.
Domain knowledge : Encode common troubleshooting paths (e.g., “high CPU → excessive INFO logs”) to accelerate root‑cause discovery.
3. Value Output
Unified visualization : Use Grafana to present metrics, traces, and logs on a single dashboard, leveraging tags to filter by service, environment, or business domain.
Collaboration (ChartOps) : Forward alerts to chat platforms (e.g., DingTalk, WeChat Work) for coordinated incident response.
Cloud‑service integration : Trigger auto‑scaling or load‑balancing actions directly from alert conditions.
Building a Unified Full‑Stack Dashboard
When constructing a comprehensive dashboard, organize panels by the following dimensions:
User‑experience : PV/UV, JavaScript error rate, First‑Contentful‑Paint, API success rate, Top‑N page performance.
Application performance : Golden three – request volume, error rate, latency – broken out per service.
Container layer : Pod CPU/Memory usage, restart count, deployment version.
Cloud services : Example – Kafka consumer lag, message throughput.
Host nodes : Node‑level CPU, memory, disk I/O, and running pod counts.
Prometheus can scrape cloud‑service metrics together with their tags (e.g., service=order, env=prod). Using Grafana’s globalview feature, multiple Prometheus instances can be queried simultaneously, enabling a single pane of glass for all layers.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
