Big Data 14 min read

Manual Kafka Offset Management in Spark Streaming using createDirectStream (Java & Scala)

This article explains how to use Spark Streaming's Direct Approach with Kafka, manually manage offsets, and provides complete Java and Scala implementations—including a JavaKafkaManager class, a demo application, and a Scala KafkaManager—illustrating the creation of DirectKafkaInputDStream, offset handling, and integration with Spark.

Big Data Technology & Architecture

Aug 4, 2020

Manual Kafka Offset Management in Spark Streaming using createDirectStream (Java & Scala)

In Spark Streaming the recommended way to consume Kafka data is the Direct Approach via KafkaUtils.createDirectStream, which requires the developer to manage offsets manually; the article first describes the concept and then presents a Java implementation derived from existing Scala examples.

The DirectKafkaInputDStream is created by a KafkaCluster instance that retrieves partition information and initial offsets (earliest or latest depending on the auto.offset.reset setting). Each streaming batch invokes the .compute method, which obtains the untilOffset, builds a corresponding KafkaRDD, reports the offset to InputInfoTracker, and returns the RDD.

KafkaRDD consists of one KafkaRDDPartition per Kafka partition; actual data is fetched lazily through a KafkaRDDIterator only when an action is triggered.

The Java version provides a JavaKafkaManager class that implements Serializable. It includes methods to create a direct stream with managed offsets, update or set offsets in Zookeeper before stream creation, and write the processed offsets back after each RDD using updateZKOffsets. A helper method toScalaImmutableMap converts Java maps to Scala immutable maps.

A runnable demo KafkaManagerDemo shows how to configure SparkConf, create a JavaStreamingContext, instantiate JavaKafkaManager, transform and process incoming messages, and finally update Zookeeper offsets.

The article also supplies an equivalent Scala implementation named KafkaManager. It mirrors the Java logic: creating the direct stream, handling offset updates before consumption, and providing an updateZKOffsets method that writes back the latest offsets to Zookeeper.

Both implementations demonstrate how to handle offset out‑of‑range exceptions by comparing stored consumer offsets with the earliest leader offsets and resetting them when necessary, ensuring reliable streaming processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Streaming kafka Spark Scala Offset Management

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.