Kafka Workflow and File Storage Mechanism: Topics, Partitions, Segments, Index and Log Files
This article explains Kafka’s workflow, detailing how topics, partitions, and segments are organized, the structure of index and log files, message composition, offset-based retrieval, and the overall data directory layout, providing a comprehensive overview of Kafka’s storage architecture.
Kafka is a distributed streaming platform that categorizes messages by topic . Producers write messages to a topic, and consumers read from it. Each partition corresponds to a physical log file where messages are appended sequentially, each assigned a unique offset .
Within a partition, Kafka splits the log into multiple segments to avoid oversized files. Every segment consists of two files: an .index file and a .log file. The index file stores sparse pointers that map logical offsets to physical positions in the log file, enabling fast binary‑search lookups.
The directory structure follows the pattern {topic‑name}-{partition‑id}. Inside each partition directory you will find files such as:
00000000000000000000.index
00000000000000000000.log
00000000000000170410.index
00000000000000170410.log
...Additional checkpoint files (e.g., cleaner-offset-checkpoint, meta.properties, recovery-point-offset-checkpoint, replication-offset-checkpoint) are created in the broker’s root data directory.
Message structure inside a .log segment includes a header, key, value, and metadata. The offset‑based retrieval process works as follows:
Perform a binary search on the .index files to locate the segment containing the desired offset.
Read the physical offset from the index entry.
Seek to that position in the corresponding .log file and read sequentially until the target message is found.
Kafka uses sparse indexing to reduce index size, leveraging the Linux mmap interface for efficient memory‑mapped file access. While sparse indexes save space, they may require additional reads compared to dense indexes.
The overall data directory layout at any moment includes the partition folders, their segment files, and various auxiliary files such as .timeindex, .deleted, .cleaned, .swap, .snapshot, and .txnindex. When a consumer group commits offsets, the information is stored in the internal topic __consumer_offsets, which is created on first use.
Images in the original article illustrate the segment layout, index‑log relationship, and directory tree, providing visual guidance for understanding Kafka’s storage architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
