Big Data 13 min read

End‑to‑End Streaming Data Pipeline with Kafka, Flink, and Apache Griffin

This tutorial demonstrates how to build a complete streaming data pipeline by configuring JDK, MySQL, Hadoop, Hive, Spark, Kafka, and Griffin, generating test data with shell scripts, processing it with Flink, and validating data quality using Apache Griffin in a Spark‑based deployment.

Big Data Technology & Architecture

Mar 16, 2022

End‑to‑End Streaming Data Pipeline with Kafka, Flink, and Apache Griffin

1. Components and Versions

The tutorial uses JDK 1.8, MySQL 5.6, Hadoop 2.7.2, Hive 2.4, Spark 2.4.1, Kafka 0.11, Griffin 0.6.0, and Zookeeper 3.4.1, assuming all are pre‑installed.

2. Kafka Data Generation Scripts

Two Bash scripts are provided: gen-data.sh creates JSON records with dynamic timestamps, random colors, and writes them to /opt/module/data/source.data; streaming-data.sh creates the required Kafka topics and repeatedly runs gen-data.sh every 90 seconds, feeding the data into the source topic.

3. Flink Streaming Processing

The Flink job reads from the source Kafka topic, applies a simple transformation that randomly modifies the name field with a 20% probability, and writes the processed records to the target topic using FlinkKafkaProducer010. The job is defined in flinkProcess.class and uses Gson for JSON parsing.

4. Apache Griffin Configuration and Startup

Two JSON configuration files are shown: dq.json defines the streaming data sources, connectors, pre‑processing steps, and evaluation rules for data quality; env.json configures Spark execution parameters, sinks (Console, HDFS, Elasticsearch), and checkpointing via Zookeeper. The pipeline is submitted with spark-submit to run the Griffin measurement application.

5. Global Code

Two Java classes are included: Student (a POJO with id, name, color, time) and flinkProcess (the main Flink job). The code demonstrates environment setup, Kafka consumer/producer creation, data transformation, and execution.

Note: If malformed data appears in Kafka, the job may error; the article suggests cleaning Kafka data directories and Zookeeper znodes before re‑generating data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Big Data Flink Streaming kafka Data Quality Apache Griffin

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.