Databases 15 min read

Deep Dive into Prometheus V2 Storage Engine and Query Process

This article explains the internal storage layout, on‑disk and in‑memory data structures, and the query execution flow of Prometheus V2, illustrating how blocks, chunks, WAL, indexes and postings are organized and accessed to serve time‑series queries efficiently.

Top Architect

Mar 8, 2023

Deep Dive into Prometheus V2 Storage Engine and Query Process

Prometheus is a popular cloud‑native time‑series database used for monitoring. Although its overall architecture has remained stable, the underlying storage engine has evolved through several versions. This article focuses on the storage format of Prometheus V2 (the current version) and how queries locate the required data.

Background : Prometheus stores data in 2‑hour blocks, each identified by a ULID. A block contains chunks (fixed‑size files), an index (inverted index), and a meta.json file with time range metadata. In addition there are chunks_head (the currently written chunk) and a write‑ahead log ( wal) for durability.

├── 01BKGV7JC0RY8A6MACW02A2PJD  // block ULID
│   ├── chunks
│   │   └── 000001
│   ├── tombstones
│   ├── index
│   └── meta.json
├── chunks_head
│   └── 000001
└── wal
    ├── 000000002
    └── checkpoint.00000001
        └── 00000000

The block directory consists of three main parts:

block : a read‑only 2‑hour slice containing chunks, index and meta.json.

chunks_head : the active chunk being written, kept in memory and flushed to disk when full.

wal : a write‑ahead log that batches writes to guarantee reliability.

Data Model : Prometheus stores a single value per sample (e.g., cpu_usage{core="1", ip="130.25.175.171"} 14.04 1618137750). The storage format of a chunk is:

┌──────────────────────────────┐
│  magic(0x0130BC91) <4 byte>   │
├──────────────────────────────┤
│    version(1) <1 byte>        │
├──────────────────────────────┤
│    padding(0) <3 byte>        │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │         Chunk 1          │ │
│ ├──────────────────────────┤ │
│ │          ...             │ │
│ ├──────────────────────────┤ │
│ │         Chunk N          │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘

# Inside a single chunk
┌─────────────────────┬───────────────────────┬───────────────────────┬───────────────────┬───────────────┬──────────────┬────────────────┐
| series ref <8 byte> | mint <8 byte> | maxt <8 byte> | encoding <1 byte> | len <uvarint> | data <bytes> │ CRC32 <4 byte> │
└─────────────────────┴───────────────────────┴───────────────────────┴───────────────────┴───────────────┴──────────────┴────────────────┘

The index file is an inverted index. It stores a symbol table, series metadata, multiple label indexes, postings lists, and a table of contents (TOC). The TOC holds offsets to the other sections. The postings offset table stores, for each label name/value pair, the file offset of the corresponding postings list.

┌────────────────────────────┐
│ magic(0xBAAAD700) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                 Symbol Table                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                    Series                    │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index 1                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index N                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings 1                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings N                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │               Label Offset Table             │ │
│ ├──────────────────────────────────────────────┤ │
│ │             Postings Offset Table            │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      TOC                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

During a query, Prometheus first looks up the postings offset for each label in the Postings Offset Table, then reads the postings list to obtain the series ref, which points to the appropriate chunk file. The chunk is memory‑mapped (mmap) for fast reads.

// open all blocks
bDirs, err := blockDirs(dir)
for _, bDir := range bDirs {
    meta, _, err := readMetaFile(bDir)
    block, open := getBlock(loaded, meta.ULID)
    if !open {
        block, err = OpenBlock(l, bDir, chunkPool)
        if err != nil {
            corrupted[meta.ULID] = err
            continue
        }
    }
    blocks = append(blocks, block)
}
// open chunk files
for _, fn := range files {
    f, err := fileutil.OpenMmapFile(fn)
    if err != nil {
        return nil, tsdb_errors.NewMulti(
            errors.Wrap(err, "mmap files"),
            tsdb_errors.CloseAll(cs),
        ).Err()
    }
    cs = append(cs, f)
    bs = append(bs, realByteSlice(f.Bytes()))
}

The in‑memory structures include DB (holding a slice of Block and a Head), Block (with an IndexReader that contains postings), and Head (with MemPostings and stripeSeries). The Head stores the most recent data in chunks_head and a write‑ahead log, while historic data lives in read‑only blocks.

type DB struct {
    blocks []*Block
    head   *Head
    // ... other fields omitted
}

type Block struct {
    // IndexReader holds postings (inverted index)
    postings map[string][]postingOffset
}

type postingOffset struct {
    value string // label value
    off   int    // offset in postings file
}

type Head struct {
    postings *index.MemPostings // in‑memory postings
    series   *stripeSeries
}

type MemPostings struct {
    mtx  sync.RWMutex
    m    map[string]map[string][]uint64 // label -> value -> posting list
    ordered bool
}

Query execution consists of two main steps: (1) use label matchers to retrieve the relevant series IDs via postings lists, applying set operations (intersection, union, negation) with optimisations such as converting negations to positive matches and lazy merging using a mergesort‑like algorithm; (2) for each matching series, read the required samples from the mmap‑ed chunk files using the series ref and time range filters.

Summary : The article walks through Prometheus’s on‑disk block layout, chunk and index formats, in‑memory data structures, and the query path that combines label‑based postings with time‑range filtering, providing a clear understanding of how Prometheus efficiently stores and retrieves time‑series data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Storage Engine Go Prometheus TSDB time series

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.