Big Data 12 min read

Real-time Data Processing with Kafka, Spark Streaming, and HBase: Implementation Guide

This article presents a step‑by‑step guide for building a real‑time data pipeline using Kafka as a message buffer, Spark‑Streaming's Direct Approach for processing, and HBase for storage, including code examples, Maven configuration, local cluster setup, and troubleshooting tips.

Big Data Technology & Architecture

Jan 7, 2020

Real-time Data Processing with Kafka, Spark Streaming, and HBase: Implementation Guide

Background: Kafka records data from Flume or business systems and serves as a reliable message buffer for upstream real‑time computation; Spark 1.3+ supports both Receiver‑based and Direct approaches, and the processed data is stored in HBase.

Implementation idea includes three parts: simulate a Kafka message producer, use Spark‑Streaming with the Direct approach to consume Kafka data in real time, and store the computed results into HBase.

Local virtual‑machine cluster configuration: due to limited hardware, Hadoop, Zookeeper, and Kafka are all deployed on three hosts named hadoop1, hadoop2, hadoop3; HBase runs as a single‑node instance on hadoop1.

Drawbacks and shortcomings: the current code design has some flaws, such as sub‑optimal performance of the HBase write logic after Spark‑Streaming computation.

Code implementation – Kafka message simulator:

package clickstream
import java.util.{Properties, Random, UUID}
import kafka.producer.{KeyedMessage, Producer, ProducerConfig}
import org.codehaus.jettison.json.JSONObject

object KafkaMessageGenerator {
  private val random = new Random()
  private var pointer = -1
  private val os_type = Array(
    "Android", "IPhone OS",
    "None", "Windows Phone")

  def click() : Double = {
    random.nextInt(10)
  }

  def getOsType() : String = {
    pointer = pointer + 1
    if(pointer >= os_type.length) {
      pointer = 0
      os_type(pointer)
    } else {
      os_type(pointer)
    }
  }

  def main(args: Array[String]): Unit = {
    val topic = "user_events"
    //本地虚拟机ZK地址
    val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092"
    val props = new Properties()
    props.put("metadata.broker.list", brokers)
    props.put("serializer.class", "kafka.serializer.StringEncoder")

    val kafkaConfig = new ProducerConfig(props)
    val producer = new Producer[String, String](kafkaConfig)

    while(true) {
      // prepare event data
      val event = new JSONObject()
      event
        .put("uid", UUID.randomUUID()) //随机生成用户id
        .put("event_time", System.currentTimeMillis.toString) //记录时间发生时间
        .put("os_type", getOsType) //设备类型
        .put("click_count", click) //点击次数

      // produce event message
      producer.send(new KeyedMessage[String, String](topic, event.toString))
      println("Message sent: " + event)

      Thread.sleep(200)
    }
  }
}

Code implementation – Spark‑Streaming main class (PageViewStream):

package clickstream
import kafka.serializer.StringDecoder
import net.sf.json.JSONObject
import org.apache.hadoop.hbase.client.{HTable, Put}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object PageViewStream {
  def main(args: Array[String]): Unit = {
    var masterUrl = "local[2]"
    if (args.length > 0) {
      masterUrl = args(0)
    }

    // Create a StreamingContext with the given master URL
    val conf = new SparkConf().setMaster(masterUrl).setAppName("PageViewStream")
    val ssc = new StreamingContext(conf, Seconds(5))

    // Kafka configurations
    val topics = Set("PageViewStream")
    //本地虚拟机ZK地址
    val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092"
    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> brokers,
      "serializer.class" -> "kafka.serializer.StringEncoder")

    // Create a direct stream
    val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

    val events = kafkaStream.flatMap(line => {
      val data = JSONObject.fromObject(line._2)
      Some(data)
    })
    // Compute user click times
    val userClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_ + _)
    userClicks.foreachRDD(rdd => {
      rdd.foreachPartition(partitionOfRecords => {
        partitionOfRecords.foreach(pair => {
          //Hbase配置
          val tableName = "PageViewStream"
          val hbaseConf = HBaseConfiguration.create()
          hbaseConf.set("hbase.zookeeper.quorum", "hadoop1:9092")
          hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
          hbaseConf.set("hbase.defaults.for.version.skip", "true")
          //用户ID
          val uid = pair._1
          //点击次数
          val click = pair._2
          //组装数据
          val put = new Put(Bytes.toBytes(uid))
          put.add("Stat".getBytes, "ClickStat".getBytes, Bytes.toBytes(click))
          val StatTable = new HTable(hbaseConf, TableName.valueOf(tableName))
          StatTable.setAutoFlush(false, false)
          //写入数据缓存
          StatTable.setWriteBufferSize(3*1024*1024)
          StatTable.put(put)
          //提交
          StatTable.flushCommits()
        })
      })
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

The Maven pom.xml file lists all required dependencies for Spark, Kafka, HBase, JSON handling, logging, and Hadoop, and configures the Scala Maven plugin and the Maven Shade plugin for building an executable JAR.

FAQ section addresses common problems such as Maven JSON‑lib dependency errors, Spark‑Streaming serialization exceptions, and Maven packaging failures caused by non‑ASCII characters in the local Maven repository path.

Reference documents include official Spark‑Streaming programming guide, Kafka integration guide, Flume integration guide, custom receiver guide, example repository, and a blog post for additional context.

Author: MichaelFly – original article link: https://www.jianshu.com/p/ccba410462ba

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing kafka maven HBase Scala Spark-Streaming

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.