Databases 8 min read

Understanding Sparse Indexes in Databases: Kafka and ClickHouse Examples

This article explains the concept of sparse indexes in database systems, compares them with dense indexes, and demonstrates their use in Kafka log files and ClickHouse MergeTree tables, highlighting implementation details, lookup procedures, and configuration parameters.

Big Data Technology & Architecture

Nov 7, 2020

Understanding Sparse Indexes in Databases: Kafka and ClickHouse Examples

In storage systems such as databases, an index is an auxiliary data structure that reduces disk I/O and speeds up queries, similar to a book's table of contents. An index consists of index keys (analogous to chapters) and pointers to the original data (analogous to page numbers).

A sparse index is an ordered index that stores entries for only a subset of the data—typically a small fraction—when the primary key is sorted. This allows queries to narrow down a range before performing a more detailed search within that range. By contrast, a dense index contains an entry for every record.

Sparse Index in Kafka

Kafka stores messages in a topic partition as log segments (.log files). Each segment is accompanied by two index files: an offset index (.index) and a timestamp index (.timeindex), both of which are sparse indexes.

.
├── 00000000000190089251.index
├── 00000000000190089251.log
├── 00000000000190089251.timeindex
├── 00000000000191671269.index
├── 00000000000191671269.log
├── 00000000000191671269.timeindex
├── 00000000000193246592.index
├── 00000000000193246592.log
├── 00000000000193246592.timeindex
├── 00000000000194821538.index
├── 00000000000194821538.log
├── 00000000000194821538.timeindex
├── 00000000000196397456.index
├── 00000000000196397456.log
├── 00000000000196397456.timeindex
├── 00000000000197971543.index
├── 00000000000197971543.log
├── 00000000000197971543.timeindex
......

Each index file maps an offset (or timestamp) to a byte position inside the corresponding .log file. The tool kafka-run-class kafka.tools.DumpLogSegments can display the contents of these index files:

~ kafka-run-class kafka.tools.DumpLogSegments --files /data4/kafka/data/ods_analytics_access_log-3/00000000000197971543.index
Dumping /data4/kafka/data/ods_analytics_access_log-3/00000000000197971543.index
offset: 197971551 position: 5207
offset: 197971558 position: 9927
offset: 197971565 position: 14624
... (more entries) ...

~ kafka-run-class kafka.tools.DumpLogSegments --files /data4/kafka/data/ods_analytics_access_log-3/00000000000197971543.timeindex
Dumping /data4/kafka/data/ods_analytics_access_log-3/00000000000197971543.timeindex
timestamp: 1593230317565 offset: 197971551
timestamp: 1593230317642 offset: 197971558
... (more entries) ...

To locate a message with offset 197971577, the process is:

Binary‑search the list of index files to find the file that contains the offset (e.g., 00000000000197971543.index).

Binary‑search within that file to locate the nearest lower indexed offset (e.g., 197971592).

Sequentially scan from that position in the .log file until the target offset is found.

The granularity of the sparse index is controlled by the log.index.interval.bytes parameter (default 4 KB). Increasing this value makes the index sparser; decreasing it makes it denser.

Sparse Index in ClickHouse

In ClickHouse, the MergeTree engine uses an ORDER BY clause to define index columns. The system creates “marks” (index marks) whose spacing is determined by the index_granularity setting (default 8192 rows). The following diagram illustrates a table indexed by CounterID and Date:

Each part of a MergeTree table is stored in a directory containing column data files ( .bin) and corresponding mark files ( .mrk2). An example part layout:

.
├── business_area_id.bin
├── business_area_id.mrk2
├── coupon_money.bin
├── coupon_money.mrk2
├── groupon_id.bin
├── groupon_id.mrk2
├── is_new_order.bin
├── is_new_order.mrk2
... (more columns) ...
├── primary.idx
...

The .mrk2 files map mark numbers to byte offsets inside the .bin files, enabling the engine to jump directly to the relevant data range after locating the appropriate part.

Unlike Kafka’s fixed‑interval sparse index, ClickHouse allows users to combine multiple columns as the index key, but adding too many columns or columns with low cardinality can make the index overly sparse and increase storage and lookup costs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data ClickHouse Databases Sparse Index

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.