How Volcengine Solves Big Data Quality Challenges with a Unified Stream‑Batch Platform
Volcengine’s Data Quality Platform bridges the gap between data validation and resource‑intensive computation in large‑scale environments, offering unified stream‑batch monitoring, data exploration, comparison, and alerting across Hive, ClickHouse, Kafka, and more, while addressing scalability, latency, and resource optimization challenges.
What Is Data Quality
Data quality refers to the degree to which data meets a set of inherent characteristics (quality dimensions). The industry typically defines six dimensions:
Completeness : Whether records and fields are complete without missing values.
Accuracy : Whether the information in records is correct and free of errors.
Consistency : Whether the same metric yields identical results across different locations.
Timeliness : Whether data is produced quickly enough to be valuable.
Conformity : Whether data follows required rules such as email or IP format validation.
Uniqueness : Whether there are duplicate values in fields.
Platform Overview
Volcengine’s Data Quality Platform, refined through years of service for ByteDance products like Toutiao and Douyin, addresses complex data‑quality scenarios across multiple product lines. It reconciles the conflict between extensive data‑quality validation and high resource consumption.
Key Functions
Data Exploration : View data details and distribution across dimensions.
Data Comparison : Compare online and test tables to detect inconsistencies.
Task Monitoring : Monitor online data, provide alerts and circuit‑breaker capabilities.
The platform’s most representative feature is primary‑key duplicate detection on Hive tables with alerting.
Data Quality Challenges
Complex scenario requirements include:
Offline monitoring for various storage systems (Hive, ClickHouse, etc.).
High‑timeliness demands for advertising systems where a 10‑minute delay can cause massive losses.
Streaming latency monitoring in complex topologies.
Micro‑batch checks that compare consecutive periods.
Additional challenges stem from massive log volumes and limited resources.
Solution Architecture
The platform consists of five main components:
Scheduler : External scheduler that triggers offline monitoring via API calls or timed jobs.
Backend : Stateless service handling API interactions, task submission, result aggregation, and alert generation.
Executor : Core task execution module built on Apache Griffin’s Measure, adapted to Spark and later to Flink for streaming.
Monitor : Independent module providing stateful alert retry and duplicate‑alert functionality.
Alert Center : External alert service that receives and forwards alarm events.
Offline Monitoring Process
Monitoring trigger: Scheduler calls Backend API.
Job submission: Backend submits Spark job to YARN in cluster mode.
Result callback: Driver syncs job outcome back to Backend.
Message trigger: Backend initiates actions such as alerts based on results.
Streaming Monitoring Process
Create Flink job according to rule definitions.
Register Bosun alert events based on alarm conditions.
Flink consumes Kafka data, computes metrics, and writes them.
Bosun periodically checks time‑series metrics and triggers alerts.
Backend receives alert callbacks and handles notification logic.
Executor Implementation
The Executor is a Spark application derived from Apache Griffin’s Measure module. It adapts data sources, converts data to DataFrames, translates rules into SQL, and computes results. Enhancements include HTTP‑based data source/sink, regex support, and migration of streaming monitoring from Spark to Flink for better performance.
Monitor Implementation
Monitor provides stateful services with HA, receives events from Backend (failures, alerts), and uses an in‑memory timed queue for event‑driven processing. It was refactored to overcome bottlenecks caused by heavy MySQL polling.
Best Practices
Table Row Count via HMS : Retrieve row counts directly from Hive Metastore partitions, reducing Spark job overhead and achieving sub‑second latency for ~90% of row‑count checks.
Offline Monitoring Optimizations : Trim unnecessary data collection, optimize joins, and tune Spark parameters (shuffle, vcores, memory) based on table size.
OLAP Engine Integration : Use Presto for fast data exploration; fallback to Spark for heavy queries, cutting median exploration time from 7 minutes to under 40 seconds.
Kafka Sampling & Multi‑Rule Optimization : Sample Kafka streams (e.g., 1% rate) and execute multiple rules within a single Flink slot to conserve resources.
Future Evolution
Unified Engine : Consolidate Spark, Flink, and Presto into a single runtime to lower operational cost.
Intelligence : Apply ML for automated threshold selection and smart alerting based on data patterns.
Convenience : Extend OLAP usage beyond exploration to quality detection and data comparison.
Optimization : Combine rule generation with data exploration for a WYSIWYG experience.
Q&A
Q: How do you handle root‑cause analysis for data‑quality issues? A: We collaborate with algorithm teams to drill down data lineage, tracing problematic fields back through upstream tables, though a full automated solution is still in progress.
Q: Who is responsible for fixing data‑quality problems and how is quality measured? A: Ownership lies with the data producers; we track metrics such as alarm volume and core alarm rates.
Q: How to ensure end‑to‑end data consistency? A: It requires comprehensive monitoring across all pipeline stages; while no single tool can guarantee it, layered checks help reduce investigation cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
