Big Data 4 min read

Understanding Kafka's Segment Storage and Index Design

This article explains how Kafka partitions data into segments, stores each segment as paired index and log files, and uses sparse indexing to enable efficient queries, illustrating the process with examples and diagrams of segment layout and offset lookup.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Kafka's Segment Storage and Index Design

This article introduces Kafka's underlying data storage format, its efficient index design, and the actual query process.

1. Segment

Kafka divides each partition into multiple segments, which are the smallest storage units. When a broker writes data to a partition and a segment reaches its size limit (default 1 GB or one week), the current segment is closed and a new one is opened. Segments that are still open are called active segments and are never deleted. This design splits a large partition file into many small files, making searches faster and allowing whole‑file deletion for data expiration.

2. Storage and Query

Each segment consists of two files that appear as a pair: an .index file and a .log file. The index file stores offsets, while the log file stores the actual data. The index file name is the starting offset of the segment.

For example, to query offset = 368775, the index file 0000000000000368769 is consulted; the offset is found at the third position (368775 = 368769 + 6), which maps to a physical position 1407 in the log file. If the index does not contain the offset, the previous index’s file offset is used for sequential scanning.

3. Index Design

Kafka uses a sparse index, storing only a subset of offsets to reduce memory usage. Because the index is not exhaustive, some queries may require additional scanning, potentially increasing lookup time.

Copyright statement: This article is compiled by the Big Data Technology and Architecture team with exclusive authorization from the original author. Unauthorized reproduction will be pursued for infringement.

Editor: 冷眼丶

WeChat public account: import_bigdata

Enjoyed the article? Please like, bookmark, and share it.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataKafkaindexstorageSegmentlog
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.