Manual Kafka Offset Management in Spark Streaming using createDirectStream (Java & Scala)
This article explains how to use Spark Streaming's Direct Approach with Kafka, manually manage offsets, and provides complete Java and Scala implementations—including a JavaKafkaManager class, a demo application, and a Scala KafkaManager—illustrating the creation of DirectKafkaInputDStream, offset handling, and integration with Spark.
In Spark Streaming the recommended way to consume Kafka data is the Direct Approach via KafkaUtils.createDirectStream, which requires the developer to manage offsets manually; the article first describes the concept and then presents a Java implementation derived from existing Scala examples.
The DirectKafkaInputDStream is created by a KafkaCluster instance that retrieves partition information and initial offsets (earliest or latest depending on the auto.offset.reset setting). Each streaming batch invokes the .compute method, which obtains the untilOffset, builds a corresponding KafkaRDD, reports the offset to InputInfoTracker, and returns the RDD.
KafkaRDD consists of one KafkaRDDPartition per Kafka partition; actual data is fetched lazily through a KafkaRDDIterator only when an action is triggered.
The Java version provides a JavaKafkaManager class that implements Serializable. It includes methods to create a direct stream with managed offsets, update or set offsets in Zookeeper before stream creation, and write the processed offsets back after each RDD using updateZKOffsets. A helper method toScalaImmutableMap converts Java maps to Scala immutable maps.
A runnable demo KafkaManagerDemo shows how to configure SparkConf, create a JavaStreamingContext, instantiate JavaKafkaManager, transform and process incoming messages, and finally update Zookeeper offsets.
The article also supplies an equivalent Scala implementation named KafkaManager. It mirrors the Java logic: creating the direct stream, handling offset updates before consumption, and providing an updateZKOffsets method that writes back the latest offsets to Zookeeper.
Both implementations demonstrate how to handle offset out‑of‑range exceptions by comparing stored consumer offsets with the earliest leader offsets and resetting them when necessary, ensuring reliable streaming processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
