Big Data 9 min read

Understanding Kafka Message Formats Across Versions 0.7.x, 0.8.x, and 0.10.x

This article explains the evolution of Kafka message formats from version 0.7.x through 0.8.x (including 0.9.x) to 0.10.x, detailing each field, compression handling, and the design motivations behind the changes.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Understanding Kafka Message Formats Across Versions 0.7.x, 0.8.x, and 0.10.x

The author, a data‑warehouse engineer at Qunar.com, writes that after years of using Kafka without studying its message format, he investigated the formats of Kafka 0.7.x, 0.8.x (and 0.9.x, which shares the same format), and 0.10.x, presenting the findings with diagrams.

A well‑designed message format should allow seamless version upgrades and backward compatibility; many new features in newer Kafka releases rely on changes to the message format, so understanding this design helps developers use Kafka more effectively.

Kafka 0.7.x Message Format

Version 0.7.x uses a simple layout consisting of the following fields:

Magic (1 byte) – identifies the Kafka version; values 0 or 1, default is 1.

Attributes (1 byte) – stores the compression codec (gzip, snappy, or none).

Crc (4 bytes) – checksum of the message payload.

Value (N‑6 bytes) – the actual payload, where N is the total message size.

Messages are grouped into a MessageSet , which adds:

Offset (8 bytes) – physical offset on disk.

Size (4 bytes) – size of the contained message.

Message (N bytes) – the message described above.

Kafka sends data to the broker as whole MessageSet units, and compression is applied to the entire set, not individual messages.

Kafka 0.8.x (0.9.x) Message Format

Version 0.7.x had several drawbacks: compressed messages could not be addressed by offset, checkpoints could only be made on whole compressed sets (limiting semantics to at‑least‑once), and the format was unsuitable for log compaction.

Kafka 0.8.0 introduced an improved format that adds key support and explicit length fields:

Crc (4 bytes) – checksum.

Magic (1 byte) – version identifier.

Attributes (1 byte) – compression codec (gzip, snappy, lz4) and, from this version, a timestamp flag.

Key length (4 bytes) – length K of the key.

Key (K bytes) – key data.

Value length (4 bytes) – length V of the value.

Value (V bytes) – payload.

The MessageSet structure remains, but the offset field now stores a logical offset (0, 1, 2, …) instead of a physical disk offset, enabling offset‑based addressing inside compressed sets and allowing checkpoints of individual messages.

Kafka 0.10.x Message Format

Kafka 0.10.x adds a timestamp field to support Kafka Streams. The layout is similar to 0.8.x with the following fields:

Crc (4 bytes) – checksum.

Magic (1 byte) – version identifier (default 1).

Attributes (1 byte) – compression codec (gzip, snappy, lz4) and timestamp type (create time or append time).

Key length (4 bytes) – length K of the key.

Key (K bytes) – key data.

Value length (4 bytes) – length V of the value.

Value (V bytes) – payload.

Compressed messages in 0.10.x follow the same pattern as in 0.8.x.

The latest Kafka version (0.11.x) introduces a completely new message format; the article does not cover it and suggests consulting the official documentation for details.

Big DataStreamingKafkaversion compatibilitycompressionMessage Format
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.