Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool
This article introduces Apache Griffin, a model‑driven big‑data data‑quality monitoring platform, explains its key features, architecture, installation requirements, and provides step‑by‑step usage examples with Hive, Kafka and Spark integration.
Apache Griffin is an open‑source data‑quality monitoring solution for big‑data environments, supporting batch sources such as Hive, text and Avro files, as well as streaming sources like Kafka, and can be extended to relational databases such as MySQL.
Griffin follows a model‑driven approach where users define quality dimensions (accuracy, completeness, timeliness, uniqueness, validity, consistency) and apply them to target or source datasets. It supports two data‑source types: batch data collected via Hadoop connectors and streaming data from message systems.
Key features include metric collection, anomaly detection with rule‑based alerts, email or portal notifications, visual dashboards, real‑time detection, scalability to petabyte‑scale workloads, and a self‑service UI for managing assets and quality rules.
The system architecture is divided into three logical layers—Define, Measure, and Analyze—implemented across a data‑collection/processing layer, a backend service layer (Spring Boot), and a user‑interface layer. Diagrams illustrate the component interactions and processing flow.
Installation requires JDK 1.8+, MySQL 5.6+, Hadoop 2.6+, Hive 2.x, Spark 2.2.1, Livy, and Elasticsearch 5+. The quick‑start guide is available on the official website, and the source code (tag griffin‑0.4.0) can be built from GitHub.
To get started, the article walks through creating Hive tables for source and target data, generating test data, and configuring a quality job via the Griffin UI. The Hive DDL example is shown below:
--create hive tables here. hql script
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_src`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY (
`dt` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs:///griffin/data/batch/demo_src';
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_tgt`(
`id` bigint,
`age` int,
`desc` string)
PARTITIONED BY (
`dt` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION
'hdfs:///griffin/data/batch/demo_tgt';After loading data, users create a measurement task in the UI, configure source/target mappings, set partitions and conditions, and schedule Spark jobs (via Quartz/Livy) to execute the quality checks. The article also shows screenshots of the job creation UI and the monitoring dashboard.
In the summary, the author notes that mastering Griffin requires familiarity with the Apache ecosystem (Spark, Hadoop, Hive, Livy, Quartz) and highlights known limitations: limited built‑in support for MySQL/other RDBMS, reliance on Spark for metric execution, and the need to extend Scala code in the Measure module.
Additional reference links to related blog posts and tutorials are provided for further reading.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
