Big Data 19 min read

How Volcengine Solves Big Data Quality Challenges with a Unified Stream‑Batch Platform

Volcengine’s Data Quality Platform bridges the gap between data validation and resource‑intensive computation in large‑scale environments, offering unified stream‑batch monitoring, data exploration, comparison, and alerting across Hive, ClickHouse, Kafka, and more, while addressing scalability, latency, and resource optimization challenges.

Volcano Engine Developer Services

Aug 11, 2021

How Volcengine Solves Big Data Quality Challenges with a Unified Stream‑Batch Platform

What Is Data Quality

Data quality refers to the degree to which data meets a set of inherent characteristics (quality dimensions). The industry typically defines six dimensions:

Completeness : Whether records and fields are complete without missing values.

Accuracy : Whether the information in records is correct and free of errors.

Consistency : Whether the same metric yields identical results across different locations.

Timeliness : Whether data is produced quickly enough to be valuable.

Conformity : Whether data follows required rules such as email or IP format validation.

Uniqueness : Whether there are duplicate values in fields.

Platform Overview

Volcengine’s Data Quality Platform, refined through years of service for ByteDance products like Toutiao and Douyin, addresses complex data‑quality scenarios across multiple product lines. It reconciles the conflict between extensive data‑quality validation and high resource consumption.

Key Functions

Data Exploration : View data details and distribution across dimensions.

Data Comparison : Compare online and test tables to detect inconsistencies.

Task Monitoring : Monitor online data, provide alerts and circuit‑breaker capabilities.

The platform’s most representative feature is primary‑key duplicate detection on Hive tables with alerting.

Data Quality Challenges

Complex scenario requirements include:

Offline monitoring for various storage systems (Hive, ClickHouse, etc.).

High‑timeliness demands for advertising systems where a 10‑minute delay can cause massive losses.

Streaming latency monitoring in complex topologies.

Micro‑batch checks that compare consecutive periods.

Additional challenges stem from massive log volumes and limited resources.

Solution Architecture

The platform consists of five main components:

Scheduler : External scheduler that triggers offline monitoring via API calls or timed jobs.

Backend : Stateless service handling API interactions, task submission, result aggregation, and alert generation.

Executor : Core task execution module built on Apache Griffin’s Measure, adapted to Spark and later to Flink for streaming.

Monitor : Independent module providing stateful alert retry and duplicate‑alert functionality.

Alert Center : External alert service that receives and forwards alarm events.

Offline Monitoring Process

Monitoring trigger: Scheduler calls Backend API.

Job submission: Backend submits Spark job to YARN in cluster mode.

Result callback: Driver syncs job outcome back to Backend.

Message trigger: Backend initiates actions such as alerts based on results.

Streaming Monitoring Process

Create Flink job according to rule definitions.

Flink consumes Kafka data, computes metrics, and writes them.

Bosun periodically checks time‑series metrics and triggers alerts.

Backend receives alert callbacks and handles notification logic.

Executor Implementation

The Executor is a Spark application derived from Apache Griffin’s Measure module. It adapts data sources, converts data to DataFrames, translates rules into SQL, and computes results. Enhancements include HTTP‑based data source/sink, regex support, and migration of streaming monitoring from Spark to Flink for better performance.

Monitor Implementation

Monitor provides stateful services with HA, receives events from Backend (failures, alerts), and uses an in‑memory timed queue for event‑driven processing. It was refactored to overcome bottlenecks caused by heavy MySQL polling.

Best Practices

Table Row Count via HMS : Retrieve row counts directly from Hive Metastore partitions, reducing Spark job overhead and achieving sub‑second latency for ~90% of row‑count checks.

Offline Monitoring Optimizations : Trim unnecessary data collection, optimize joins, and tune Spark parameters (shuffle, vcores, memory) based on table size.

OLAP Engine Integration : Use Presto for fast data exploration; fallback to Spark for heavy queries, cutting median exploration time from 7 minutes to under 40 seconds.

Kafka Sampling & Multi‑Rule Optimization : Sample Kafka streams (e.g., 1% rate) and execute multiple rules within a single Flink slot to conserve resources.

Future Evolution

Unified Engine : Consolidate Spark, Flink, and Presto into a single runtime to lower operational cost.

Intelligence : Apply ML for automated threshold selection and smart alerting based on data patterns.

Convenience : Extend OLAP usage beyond exploration to quality detection and data comparison.

Optimization : Combine rule generation with data exploration for a WYSIWYG experience.

Q&A

Q: How do you handle root‑cause analysis for data‑quality issues? A: We collaborate with algorithm teams to drill down data lineage, tracing problematic fields back through upstream tables, though a full automated solution is still in progress.

Q: Who is responsible for fixing data‑quality problems and how is quality measured? A: Ownership lies with the data producers; we track metrics such as alarm volume and core alarm rates.

Q: How to ensure end‑to‑end data consistency? A: It requires comprehensive monitoring across all pipeline stages; while no single tool can guarantee it, layered checks help reduce investigation cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Big Data stream processing Data Quality

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.