Big Data 8 min read

Achieving Exactly-Once Semantics in Kafka and Spark Streaming

This article explains the three message delivery semantics in distributed stream processing, compares Kafka‑Spark Streaming integration methods (receiver vs direct stream), and details how to achieve exactly‑once guarantees through idempotent or transactional writes, including code examples.

Big Data Technology & Architecture

Jun 13, 2020

Achieving Exactly-Once Semantics in Kafka and Spark Streaming

In distributed stream processing systems such as Kafka, Storm, Flink, and Spark Streaming, three message delivery semantics exist: at least once (messages may be delivered multiple times), at most once (messages are delivered zero or one time), and exactly once (each message is delivered precisely once).

In most daily workloads, 90% of streaming jobs are built with Kafka + Spark Streaming + HDFS, where Kafka acts as the message queue. This article focuses on how to guarantee the exactly‑once semantics.

The traditional receiver‑based approach uses Kafka’s high‑level consumer API; each executor continuously pulls messages and writes them to both executor memory and a write‑ahead log (WAL) on HDFS, updating the offset in ZooKeeper after the WAL write. This method can only guarantee at‑least‑once semantics because offset updates may fail, causing duplicate processing and reduced throughput.

The newer direct‑stream approach uses Kafka’s simple consumer API. Offsets are obtained by the driver for each batch, and executors read messages directly from the assigned offset ranges. Since offsets are managed solely by the streaming application, inconsistencies are eliminated, enabling exactly‑once delivery, provided the application correctly manages offsets.

Spark RDDs are immutable, partitioned, and fault‑tolerant; they can be recomputed from the original data, ensuring deterministic results and supporting exactly‑once processing logic.

Output operations in Spark Streaming typically use foreachRDD, which by default provides at‑least‑once guarantees. To achieve exactly‑once output, two strategies are possible: idempotent writes or transactional writes.

Idempotent write means that writing the same record multiple times yields the same result as writing it once. This works well for data with natural primary keys (e.g., logs, MySQL binlog). It requires a map‑only processing pipeline without shuffles or aggregations.

stream.foreachRDD { rdd =>
      rdd.foreachPartition { iter =>
        // make sure connection pool is set up on the executor before writing
        SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)

        iter.foreach { case (key, msg) =>
          DB.autoCommit { implicit session =>
            // the unique key for idempotency is just the text of the message itself, for example purposes
            sql"insert into idem_data(msg) values (${msg})".update.apply
          }
        }
      }
    }

Transactional write follows the classic DBMS transaction model. A unique identifier (e.g., topic, partition, offset) is stored together with the computation results. If either the data write or the offset update fails, the whole transaction is rolled back.

// localTx is transactional, if metric update or offset update fails, neither will be committed
    DB.localTx { implicit session =>
      // store metric data
      val metricRows = sql"""
    update txn_data set metric = metric + ${metric}
      where topic = ${osr.topic}
    """.update.apply()
      if (metricRows != 1) {
        throw new Exception("...")
      }

      // store offsets
      val offsetRows = sql"""
    update txn_offsets set off = ${osr.untilOffset}
      where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
    """.update.apply()
      if (offsetRows != 1) {
        throw new Exception("...")
      }
    }

By combining the direct‑stream integration for Kafka, Spark’s immutable RDD processing, and either idempotent or transactional output, a complete exactly‑once pipeline can be built for reliable big‑data streaming applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data kafka @Transactional Spark Streaming Exactly-once idempotent

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.