Understanding Prometheus V2 Storage Engine and Query Process
This article explains the architecture of Prometheus V2, detailing its on‑disk block layout, chunk and index formats, the inverted index mechanism, and how queries locate and retrieve time‑series data, while also covering in‑memory structures and practical usage patterns.
Prometheus is a popular cloud‑native time‑series database used for monitoring. Although its overall architecture remains stable, the underlying storage engine has evolved through several versions. This article focuses on the storage format of Prometheus V2 (v2.25.2) and how queries locate matching data.
Background : Prometheus stores data in blocks (default 2‑hour intervals) identified by ULIDs, each containing chunks , index , and meta.json . Additional directories include chunks_head for the currently written chunk and wal for write‑ahead logging.
├── 01BKGV7JC0RY8A6MACW02A2PJD // block ULID
│ ├── chunks
│ │ └── 000001
│ ├── tombstones
│ ├── index
│ └── meta.json
├── chunks_head
│ └── 000001
└── wal
├── 000000002
└── checkpoint.00000001
└── 00000000The design addresses two key time‑series characteristics: vertical writes (latest data) and horizontal reads (range queries). Data is partitioned by time to handle short lifecycles typical in cloud‑native environments.
Disk Storage Format : Each block file starts with a magic number, version, and padding, followed by a series of chunks. A chunk header contains fields such as series ref , mint , maxt , encoding , len , and a CRC32 checksum.
┌──────────────────────────────┐
│ magic(0x0130BC91) <4 byte> │
├──────────────────────────────┤
│ version(1) <1 byte> │
├──────────────────────────────┤
│ padding(0) <3 byte> │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │ Chunk 1 │ │
│ ├──────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────┤ │
│ │ Chunk N │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘
# 单个 chunk 内的结构
┌─────────────────────┬───────────────────────┬───────────────────────┬───────────────────┬───────────────┬──────────────┬────────────────┐
| series ref <8 byte> | mint <8 byte> | maxt <8 byte> | encoding <1 byte> | len
| data
│ CRC32 <4 byte> │
└─────────────────────┴───────────────────────┴───────────────────────┴────────Index Format : The index file begins with a magic number and version, followed by a symbol table, series table, multiple label index sections, postings lists, and offset tables. The TOC stores offsets for fast navigation.
┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b> │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │ Symbol Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Series │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Index 1 │ │
│ ├──────────────────────────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Index N │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings 1 │ │
│ ├──────────────────────────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings N │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Label Offset Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Postings Offset Table │ │
│ ├──────────────────────────────────────────────┤ │
│ │ TOC │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘During a query, Prometheus first looks up the posting offset table to find the location of label postings, then retrieves the corresponding series ref to locate the appropriate chunk on disk. The postings list is stored sparsely (every 32nd entry) to reduce memory usage.
// For the postings offset table we keep every label name but only every nth
// label value (plus the first and last one), to save memory.
ReadOffsetTable(r.b, r.toc.PostingsTable, func(key []string, _ uint64, off int) error {
if _, ok := r.postings[key[0]]; !ok {
r.postings[key[0]] = []postingOffset{}
if lastKey != nil {
r.postings[lastKey[0]] = append(r.postings[lastKey[0]], postingOffset{value: lastKey[1], off: lastOff})
}
lastKey = nil
valueCount = 0
}
if valueCount%32 == 0 {
r.postings[key[0]] = append(r.postings[key[0]], postingOffset{value: key[1], off: off})
lastKey = nil
} else {
lastKey = key
lastOff = off
}
valueCount++
})
if lastKey != nil {
r.postings[lastKey[0]] = append(r.postings[lastKey[0]], postingOffset{value: lastKey[1], off: lastOff})
}The in‑memory structures include a DB holding an array of Block objects and a Head for the currently writing block. Block contains an IndexReader with postings, while Head maintains a MemPostings map for fast label lookups.
type DB struct {
blocks []*Block
head *Head
// ... other fields omitted
}
// Block's main field is IndexReader, which holds postings (inverted index)
postings map[string][]postingOffset
type postingOffset struct {
value string // label value
off int // offset in postings file
}
type Head struct {
postings *index.MemPostings // Postings lists for terms.
series *stripeSeries
// ... other fields omitted
}
type MemPostings struct {
mtx sync.RWMutex
m map[string]map[string][]uint64 // label key -> label value -> posting lists
ordered bool
}When executing a query, Prometheus first resolves label matchers to posting lists (using binary search and a merge‑sort‑like lazy intersection), then iterates over the resulting time‑series, reading the required chunks via mmap for efficient I/O.
type intersectPostings struct {
arr []Postings // array of postings to intersect
cur uint64 // current series ID
}
func (it *intersectPostings) doNext() bool {
for {
for _, p := range it.arr {
if !p.Seek(it.cur) {
return false
}
if p.At() > it.cur {
it.cur = p.At()
continue Loop
}
}
return true
}
}
func (it *intersectPostings) Next() bool {
for _, p := range it.arr {
if !p.Next() {
return false
}
if p.At() > it.cur {
it.cur = p.At()
}
}
return it.doNext()
}In summary, Prometheus organizes data into time‑partitioned blocks, uses a compact on‑disk chunk format, and relies on a sparse inverted index to support fast label‑based queries while keeping memory consumption low. Compaction merges older blocks to improve query performance over larger time ranges.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.