Unlocking Prometheus: How TSDB Powers Scalable Monitoring and Real-Time Analytics
This article explains how Prometheus uses a time‑series database (TSDB) to handle massive monitoring data, detailing its concepts, query examples, storage engine design, indexing mechanisms, and the benefits of pre‑computing expressions for efficient real‑time analysis.
Background
Many beginners feel overwhelmed by Prometheus because it introduces many concepts and a steep learning curve. Core concepts include Instance, Job, Metric, Metric Name, Metric Label, Metric Value, Metric Type (Counter, Gauge, Histogram, Summary), Data Types (Instant Vector, Range Vector, Scalar, String), Operators, and Functions. Like Alibaba’s data‑driven approach, Prometheus is fundamentally a data‑centric monitoring system.
Daily Monitoring
To monitor each API of a web server (e.g., WebServerA), dimensions such as service name (job), instance IP (instance), API name (handler), method, response code, and request count are tracked.
Example SQL‑like queries:
<code>SELECT * FROM http_requests_total WHERE code="200" AND method="put" AND created_at BETWEEN 1495435700 AND 1495435710;</code> <code>SELECT * FROM http_requests_total WHERE handler="prometheus" AND method="post" AND created_at BETWEEN 1495435700 AND 1495435710;</code> <code>SELECT * FROM http_requests_total WHERE handler="query" AND instance="10.59.8.110" AND created_at BETWEEN 1495435700 AND 1495435710;</code>When monitoring hundreds of services with many instances, APIs, and methods, the data volume quickly reaches billions of rows, making traditional relational databases impractical. Therefore, Prometheus adopts a Time‑Series Database (TSDB) as its storage engine.
Storage Engine
TSDB fits the monitoring workload perfectly:
Massive data volume.
Predominantly write‑heavy operations.
Writes are mostly sequential, ordered by time.
Rare updates; data is written shortly after collection.
Deletion occurs in block ranges, not individual points.
Data size exceeds memory, limiting cache effectiveness.
Reads are typically ordered scans (ascending or descending).
High‑concurrency reads are common.
TSDB stores data as two parts:
labels(dimension tags) and
samples(timestamp‑value pairs). Labels uniquely identify a time series, while samples hold the actual metric values.
<code>{"labels": [{"latency": "500"}], "samples": [{"timestamp": 1473305798, "value": 0.9}]}</code>The internal structure can be visualized as:
<code>series
│ ... server{latency="500"}
│ ... server{latency="300"}
│ ... server{}
│
<-------- time --------></code>TSDB uses
timeseries:doc::as the key for values and builds three indexes to accelerate queries:
Series Index : stores ordered label‑key pairs.
Label Index : maps each label to a list of its values and references to the corresponding series.
Time Index : maps time ranges to data blocks, allowing fast skipping of irrelevant segments.
Data Computation
The powerful storage engine enables complex calculations. Prometheus can select multiple metric series, apply arithmetic operators, and use built‑in functions to perform matrix operations, effectively providing both a data warehouse and a computation platform for monitoring.
One Calculation, Multiple Queries
Because such calculations are resource‑intensive, pre‑computing results is advantageous. Prometheus offers Recording Rules to evaluate expensive expressions in advance and store the results as new time series, enabling a single computation to serve many queries and alerts.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.