Prometheus Overview: Architecture, Metrics, Data Collection, and Storage
This article provides a comprehensive overview of Prometheus, an open‑source monitoring and alerting system, covering its origins, key features, architecture, core components, metric types, data collection methods, service discovery, storage options, and query capabilities.
Prometheus
Introduction
Prometheus is an open‑source monitoring and alerting system originally developed by SoundCloud. It is written in Go and is the open‑source version of Google’s BorgMon monitoring system. In 2016, the Cloud Native Computing Foundation (CNCF) under the Linux Foundation accepted it as the second most popular open‑source project.
Features
Multi‑dimensional data model
Flexible query language
Support for both local and remote storage
Open metric data standard
HTTP Pull‑based data collection
Static file and dynamic discovery mechanisms
Easy maintenance
Support for data sharding, sampling, and federation deployments
Architecture Design
Core Components
Server: periodically scrapes metrics from targets.
Target: exposes an HTTP endpoint for the Server to scrape.
AlertManager: receives alerts from the Server and handles notification routing.
Grafana: visualizes monitoring data.
Exporters: expose metrics of third‑party services to Prometheus.
Monitoring Metrics
Metric Definition
<metric name>{<label name>=<label value>, ...}
Metric name: must consist of letters, digits, underscores or colons and match the regex [a-zA-Z:][a-zA-Z0-9:]* (colon not allowed in exporter names).
Label: key‑value pair that adds dimensionality for filtering and aggregation.
Example
http_request_total{status="200",method="POST"}
{__name__="http_request_total",status="200",method="POST"}Both examples represent the same metric; labels starting with an underscore are reserved for internal use.
Metric name http_request_total counts total HTTP requests.
Label status="200" indicates HTTP status code 200.
Label method="POST" indicates the request method.
Metric Types
Counter
Monotonically increasing values (e.g., request counts, uptime).
Never reset on restart; often used with rate() to compute per‑second change.
Gauge
Values that can go up or down (e.g., CPU or memory usage).
Most real‑time monitoring data are Gauges.
Summary
Provides quantile information for a distribution (e.g., request latency).
Can be converted to/from Histograms.
More CPU‑intensive than Gauges; does not allow calculation of averages directly.
Histogram
Counts samples in configurable buckets using the le label to define upper bounds.
Data Samples
Prometheus stores collected samples as time‑series in an in‑memory database and periodically persists them to disk.
Each time‑series is identified by a metric name and a set of label pairs.
Sample Composition
Metric: name and associated label set describing the sample.
Timestamp: millisecond‑precision time of collection.
Value: a 64‑bit floating‑point number representing the metric value.
Data Collection
Prometheus primarily uses a Pull model, unlike Push‑based systems.
Pull Model
Real‑time
Periodic scraping; latency depends on scrape interval, generally less real‑time than Push.
State Persistence
Targets must be able to serve data; the Server remains stateless.
Control
The Server decides what to scrape and how often.
Configuration Complexity
Targets can be discovered via static files or service‑discovery mechanisms, keeping configuration simple and decoupled.
Push Model
Real‑time
Data is sent immediately to the monitoring system, offering lower latency.
State Persistence
Targets are stateless; the Server must maintain target state.
Control
Targets dictate what and when to push.
Configuration Complexity
Each target must be configured with the Server’s address.
Service Discovery
Static Configuration
Traditional method using a static file listing target addresses (e.g., "target": ["10.10.10.10:8080"]).
Dynamic Discovery
Suited for cloud environments with auto‑scaling.
Integrates with container orchestration platforms (e.g., Kubernetes) by listening to API changes and updating the target list automatically.
Data Storage
Local Storage
Built‑in time‑series database writes data to local disk.
Remote Storage
Used for large‑scale data retention.
Supports back‑ends such as OpenTSDB, InfluxDB, Elasticsearch via adapters.
Data Query
Prometheus provides PromQL and HTTP APIs for querying collected data.
Visualization options include Grafana, the built‑in PromDash, and custom template engines.
DevOps Cloud Academy
Exploring industry DevOps practices and technical expertise.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.