Operations 18 min read

Master Observability: From Metrics to Tracing with Prometheus, Grafana & OpenTelemetry

This comprehensive guide explains observability concepts, compares monitoring and observability, details metrics, logs, and tracing pillars, walks through Prometheus and Grafana setup, explores OpenTelemetry integration, and provides a Spring Boot example with full configuration and code snippets.

DevOps Coach
DevOps Coach
DevOps Coach
Master Observability: From Metrics to Tracing with Prometheus, Grafana & OpenTelemetry

Observability Basics

Observability is the ability to understand a system’s internal state by examining its outputs. Its three pillars are metrics (quantitative events over time), logs (detailed event records), and traces (paths of requests across services). Building observability requires log aggregation, metric aggregation, and distributed tracing, complemented by semantic monitoring, alerting, and production testing.

Monitoring vs. Observability

Monitoring focuses on generating alerts from collected data to detect problems, while observability emphasizes exploration and insight. Over‑alerting leads to alert fatigue; effective alerts must be relevant, non‑duplicate, timely, prioritized, and actionable.

Semantic Monitoring

Semantic monitoring checks whether the system performs the correct business outcomes (e.g., successful user registration or order placement) rather than merely tracking server metrics. It relies on business‑level signals and Service Level Objectives (SLOs).

Log Aggregation

In microservice environments, logs are collected centrally via agents that forward local log files to a central store. A consistent log format and correlation IDs are essential for stitching together request flows across services. Log re‑formatting before forwarding is discouraged due to CPU overhead.

Because timestamps from different machines are not perfectly synchronized, logs alone cannot reliably establish global ordering; distributed tracing is preferred for precise timing and causal analysis.

Metric Aggregation

Metrics aggregation collects data from many services and machines, enabling analysis of normal behavior and detection of anomalies. High‑cardinality metrics (e.g., per‑user IDs) require specialized storage solutions, whereas low‑cardinality metrics (CPU usage, request counts) fit traditional tools.

Distributed Tracing

Distributed tracing records each request’s journey across services as a series of spans, each with start/end timestamps and context metadata. Sampling reduces overhead, and tools like OpenTelemetry provide standard instrumentation and export mechanisms.

SLA, SLO, SLI, and Error Budget

Service Level Agreements (SLAs) are external commitments, while Service Level Objectives (SLOs) are internal targets (e.g., 99.9% success). Service Level Indicators (SLIs) measure those targets. An error budget quantifies allowable failure; staying within the budget permits changes, while exhausting it forces a focus on reliability.

Prometheus Overview

Prometheus is a pull‑based, metrics‑focused monitoring system with a built‑in time‑series database (TSDB). It discovers targets via service discovery, scrapes metrics via HTTP, and stores millions of samples per second. Configuration resides in prometheus.yml, where jobs and scrape intervals are defined.

PromQL (Prometheus Query Language) enables powerful querying, filtering, and aggregation of metrics for debugging, trend analysis, and alert rule creation.

Grafana Integration

Grafana visualizes data from Prometheus, Elasticsearch, and other sources via customizable dashboards. It supports alerting based on threshold breaches. After installing Grafana, add Prometheus as a data source (URL http://localhost:9090) and create dashboards using panels such as Graph.

Node Exporter

Node Exporter exposes host‑level metrics (CPU, memory, disk) for Unix systems. Install it from the Prometheus website and add a job in prometheus.yml targeting localhost:9100.

OpenTelemetry Comparison

OpenTelemetry is a comprehensive observability framework supporting metrics, traces, and logs, with a plug‑in architecture for various back‑ends. Unlike Prometheus, which focuses on metrics, OpenTelemetry can collect all three signals and integrates with many exporters.

Sample Spring Boot Application with OpenTelemetry

The following example demonstrates a simple CRUD service using Spring Boot, H2 database, and OpenTelemetry instrumentation. The application.yml configures the data source, OpenTelemetry exporter to Jaeger, and disables default metrics and logs collection.

spring:
  application:
    name: openTelemetry
  datasource:
    url: jdbc:h2:mem:testdb
    driver-class-name: org.h2.Driver
    username: luispiquinrey
    password:
  jpa:
    hibernate:
      ddl-auto: create-drop
    show-sql: true
otel:
  exporter:
    jaeger:
      endpoint: http://localhost:14268/api/traces
  metrics:
    enabled: false
  logs:
    enabled: false
  resource:
    attributes:
      service:
        name: openTelemetry
server:
  port: 8085

Controller code (Java) uses OpenTelemetry’s Tracer and Meter to create spans and a counter for created persons.

@RestController
@RequestMapping("/persons")
public class PersonController {
    private final PersonRepository repository;
    private final Tracer tracer;
    private final LongCounter createCounter;

    public PersonController(PersonRepository repository, @Qualifier("openTelemetry") OpenTelemetry openTelemetry) {
        this.repository = repository;
        this.tracer = openTelemetry.getTracer("person-controller");
        Meter meter = openTelemetry.getMeter("person-controller");
        this.createCounter = meter.counterBuilder("person.create.count")
            .setDescription("Number of persons created")
            .build();
    }

    @PostMapping
    public Person create(@RequestBody Person person) {
        Span span = tracer.spanBuilder("create-person").startSpan();
        try {
            Person saved = repository.save(person);
            createCounter.add(1);
            return saved;
        } finally {
            span.end();
        }
    }
    // Additional CRUD methods (GET, PUT, DELETE) similarly instrumented with spans
}

Putting It All Together

Deploy Prometheus, Node Exporter, and Grafana; configure Prometheus to scrape Node Exporter and any application exporters. Use Grafana dashboards to visualize metrics, and Jaeger (or Elastic Observability) to view traces. This stack provides a full observability solution covering metrics, logs, and traces.

OpenTelemetrydistributed tracingGrafana
DevOps Coach
Written by

DevOps Coach

Master DevOps precisely and progressively.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.