Rethinking Prometheus TSDB: From V2 Bottlenecks to the Scalable V3 Design
This article examines the limitations of Prometheus's original V2 time‑series storage, proposes a block‑oriented V3 architecture that tackles series churn, write amplification, and indexing inefficiencies, and validates the new design with extensive benchmarks showing dramatic reductions in memory, CPU, and disk usage.
Background and Motivation
Prometheus, the CNCF‑backed monitoring system, stores metrics in a custom time‑series database (TSDB) that works well for many Kubernetes workloads. However, as workloads become highly dynamic—continuous deployments, auto‑scaling, and frequent rolling updates—the existing V2 storage faces pressure from massive series churn, write amplification on SSDs, and inefficient indexing.
Time‑Series Model
Each metric is identified by a name and a set of label dimensions. A data point is a (timestamp, value) tuple, typically stored as a 64‑bit float. Example series identifiers: identifier -> (t0, v0), (t1, v1), (t2, v2), ... Sample metric samples:
requests_total{path="/status", method="GET", instance="10.0.0.1:80"}
requests_total{path="/status", method="POST", instance="10.0.0.3:80"}
requests_total{path="/", method="GET", instance="10.0.0.2:80"}For queries the metric name can be treated as a special label __name__:
{__name__="requests_total", path="/status", method="GET"}Limitations of the V2 Storage
Each series is stored in its own file, leading to millions of files, inode exhaustion, and high open‑file overhead.
Writes are performed per‑sample, causing severe write amplification on SSDs because a 16‑byte sample forces a 4 KiB page write.
Retention and deletion require scanning and rewriting billions of files, consuming hours of CPU and I/O.
Indexing uses a LevelDB‑based label‑pair index that does not scale well for multi‑label queries, resulting in O(n²) lookup costs.
Design Goals for V3
The new design keeps the useful ideas of block storage and compression while eliminating per‑series files. Core goals:
Group time‑series data into immutable blocks that cover a fixed time window.
Store all series of a block together, reducing the number of open files.
Use a write‑ahead log (WAL) for crash‑recovery and keep recent data in memory for fast queries.
Introduce compaction to merge smaller blocks into larger ones, reclaiming space and improving query performance.
Maintain a simple inverted index that maps label values to sorted series IDs, enabling O(m) lookups where m is the result set size.
Block Layout (V3)
On disk the top‑level directory contains a series of b‑<block‑id> folders. Each block holds:
chunks/ – raw data chunks for many series.
index – an immutable index for the block.
meta.json – metadata describing the block’s time range and compaction state.
Example directory tree:
$ tree ./data
./data
+-- b-000001
| +-- chunks
| | +-- 000001
| | +-- 000002
| | +-- 000003
| +-- index
| +-- meta.json
+-- b-000004
| +-- chunks
| | +-- 000001
| +-- index
| +-- meta.json
+-- wal
| +-- 000001
| +-- 000002
| +-- 000003Blocks are immutable; new samples are written to an in‑memory head block and periodically flushed to a new block on disk. The WAL guarantees durability between flushes.
Compaction and Retention
When several sequential blocks are ready, the compactor merges them into a larger block, optionally discarding deleted series. Retention is simply deleting whole block directories that fall outside the configured time window, turning a previously expensive operation into an O(1) filesystem delete.
Inverted Index
Each series receives a unique numeric ID. For every label value an inverted list of IDs is stored in sorted order. Query processing intersects these lists using a cursor‑based merge, reducing the complexity from O(n²) to O(k·n) and typically to O(m) where m is the size of the final result set.
Example intersection:
__name__="requests_total" -> [9999,1000,1001,2000000,...]
app="foo" -> [1,3,10,11,12,100,311,320,1000,1001,...]
intersection -> [1000,1001]Benchmarking
Benchmarks were run on a synthetic dataset of ~4.4 M series with generated samples. Key findings:
V3 achieved ~20× higher ingest throughput on a laptop (2 × 10⁷ samples/s) compared to V2.
Memory usage dropped 3–4×; after 6 h the V1.5.2 instance showed a sharp memory spike due to retention, while V2.0 remained stable.
CPU consumption for queries fell 3–10×.
Disk write amplification fell from near‑100 % on V1.5.2 to under 2 % on V2.0, dramatically extending SSD lifespan.
Query latency (99th percentile) stayed low for V2.0 even under heavy series churn, whereas V1.5.2 latency grew with the number of active series.
Figures (kept as images) illustrate memory, CPU, disk I/O, and latency trends for both versions.
Conclusions
The V3 block‑oriented storage solves the core problems of series churn, write amplification, and inefficient indexing while preserving the high‑performance compression of the original design. Benchmarks confirm substantial reductions in resource consumption and improved scalability, making the new TSDB suitable for large‑scale, highly dynamic Kubernetes environments.
Source code for the storage engine is available at
and the benchmarking suite at
.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
