Master Observability: From Metrics to Tracing with Prometheus, Grafana & OpenTelemetry
This comprehensive guide explains observability concepts, compares monitoring and observability, details metrics, logs, and tracing pillars, walks through Prometheus and Grafana setup, explores OpenTelemetry integration, and provides a Spring Boot example with full configuration and code snippets.
Observability Basics
Observability is the ability to understand a system’s internal state by examining its outputs. Its three pillars are metrics (quantitative events over time), logs (detailed event records), and traces (paths of requests across services). Building observability requires log aggregation, metric aggregation, and distributed tracing, complemented by semantic monitoring, alerting, and production testing.
Monitoring vs. Observability
Monitoring focuses on generating alerts from collected data to detect problems, while observability emphasizes exploration and insight. Over‑alerting leads to alert fatigue; effective alerts must be relevant, non‑duplicate, timely, prioritized, and actionable.
Semantic Monitoring
Semantic monitoring checks whether the system performs the correct business outcomes (e.g., successful user registration or order placement) rather than merely tracking server metrics. It relies on business‑level signals and Service Level Objectives (SLOs).
Log Aggregation
In microservice environments, logs are collected centrally via agents that forward local log files to a central store. A consistent log format and correlation IDs are essential for stitching together request flows across services. Log re‑formatting before forwarding is discouraged due to CPU overhead.
Because timestamps from different machines are not perfectly synchronized, logs alone cannot reliably establish global ordering; distributed tracing is preferred for precise timing and causal analysis.
Metric Aggregation
Metrics aggregation collects data from many services and machines, enabling analysis of normal behavior and detection of anomalies. High‑cardinality metrics (e.g., per‑user IDs) require specialized storage solutions, whereas low‑cardinality metrics (CPU usage, request counts) fit traditional tools.
Distributed Tracing
Distributed tracing records each request’s journey across services as a series of spans, each with start/end timestamps and context metadata. Sampling reduces overhead, and tools like OpenTelemetry provide standard instrumentation and export mechanisms.
SLA, SLO, SLI, and Error Budget
Service Level Agreements (SLAs) are external commitments, while Service Level Objectives (SLOs) are internal targets (e.g., 99.9% success). Service Level Indicators (SLIs) measure those targets. An error budget quantifies allowable failure; staying within the budget permits changes, while exhausting it forces a focus on reliability.
Prometheus Overview
Prometheus is a pull‑based, metrics‑focused monitoring system with a built‑in time‑series database (TSDB). It discovers targets via service discovery, scrapes metrics via HTTP, and stores millions of samples per second. Configuration resides in prometheus.yml, where jobs and scrape intervals are defined.
PromQL (Prometheus Query Language) enables powerful querying, filtering, and aggregation of metrics for debugging, trend analysis, and alert rule creation.
Grafana Integration
Grafana visualizes data from Prometheus, Elasticsearch, and other sources via customizable dashboards. It supports alerting based on threshold breaches. After installing Grafana, add Prometheus as a data source (URL http://localhost:9090) and create dashboards using panels such as Graph.
Node Exporter
Node Exporter exposes host‑level metrics (CPU, memory, disk) for Unix systems. Install it from the Prometheus website and add a job in prometheus.yml targeting localhost:9100.
OpenTelemetry Comparison
OpenTelemetry is a comprehensive observability framework supporting metrics, traces, and logs, with a plug‑in architecture for various back‑ends. Unlike Prometheus, which focuses on metrics, OpenTelemetry can collect all three signals and integrates with many exporters.
Sample Spring Boot Application with OpenTelemetry
The following example demonstrates a simple CRUD service using Spring Boot, H2 database, and OpenTelemetry instrumentation. The application.yml configures the data source, OpenTelemetry exporter to Jaeger, and disables default metrics and logs collection.
spring:
application:
name: openTelemetry
datasource:
url: jdbc:h2:mem:testdb
driver-class-name: org.h2.Driver
username: luispiquinrey
password:
jpa:
hibernate:
ddl-auto: create-drop
show-sql: true
otel:
exporter:
jaeger:
endpoint: http://localhost:14268/api/traces
metrics:
enabled: false
logs:
enabled: false
resource:
attributes:
service:
name: openTelemetry
server:
port: 8085Controller code (Java) uses OpenTelemetry’s Tracer and Meter to create spans and a counter for created persons.
@RestController
@RequestMapping("/persons")
public class PersonController {
private final PersonRepository repository;
private final Tracer tracer;
private final LongCounter createCounter;
public PersonController(PersonRepository repository, @Qualifier("openTelemetry") OpenTelemetry openTelemetry) {
this.repository = repository;
this.tracer = openTelemetry.getTracer("person-controller");
Meter meter = openTelemetry.getMeter("person-controller");
this.createCounter = meter.counterBuilder("person.create.count")
.setDescription("Number of persons created")
.build();
}
@PostMapping
public Person create(@RequestBody Person person) {
Span span = tracer.spanBuilder("create-person").startSpan();
try {
Person saved = repository.save(person);
createCounter.add(1);
return saved;
} finally {
span.end();
}
}
// Additional CRUD methods (GET, PUT, DELETE) similarly instrumented with spans
}Putting It All Together
Deploy Prometheus, Node Exporter, and Grafana; configure Prometheus to scrape Node Exporter and any application exporters. Use Grafana dashboards to visualize metrics, and Jaeger (or Elastic Observability) to view traces. This stack provides a full observability solution covering metrics, logs, and traces.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
