How ByteDance’s DataLeap Solves Complex Data Quality Challenges at Scale
This article explains how ByteDance’s DataLeap platform tackles diverse data quality challenges across batch and streaming pipelines by defining quality dimensions, outlining a modular architecture, and sharing best‑practice optimizations for Spark, Flink and Presto‑based monitoring.
What Is Data Quality Management?
Data quality is the degree to which data meets a set of inherent characteristic requirements (quality dimensions). The industry commonly defines six dimensions:
Integrity : Whether records and fields are complete without missing values.
Accuracy : Whether the recorded information is correct and free of anomalies.
Consistency : Whether the same metric yields identical results across different locations.
Timeliness : Whether data is produced quickly enough to retain its value.
Conformity : Whether data follows prescribed rules such as email, IP, or phone format validation.
Uniqueness : Whether data contains duplicate values.
Platform Functions
The DataLeap platform provides three core capabilities:
Data Exploration – view details and distribution across dimensions.
Data Comparison – detect inconsistencies between online and test tables.
Task Monitoring – monitor online data with alert and circuit‑breaker features.
System Architecture
The architecture consists of five modules:
Scheduler : external scheduler that triggers offline monitoring via API or timed jobs.
Backend : stateless service handling API requests, task submission, result aggregation and alert dispatch.
Executor : Spark‑based engine (adapted from Apache Griffin Measure) that adapts data sources, translates rules to SQL, and computes results.
Monitor : stateful service that receives backend events, provides retry and duplicate‑alert mechanisms, and uses an in‑memory queue for event timing.
Alert Center : external alerting service that receives and forwards alarm events.
Offline Data Detection Process
Monitoring trigger: scheduler calls Backend API.
Job submission: Backend submits Spark job to YARN in cluster mode.
Result callback: driver syncs success/failure results back to Backend.
Message trigger: Backend initiates actions such as alerts.
Executor Implementation
The Executor is a customized Spark application based on Apache Griffin Measure. It adapts various data sources, converts rules to SQL, and supports extensions like regular‑expression validation. For streaming scenarios, the engine was switched from Spark to Flink to improve latency and resource usage.
Monitor Implementation
Monitor handles failure retries and duplicate alerts. Initially it polled MySQL for already‑alerted instances, but scaling issues led to a redesign with a stateful service, HA master‑slave nodes, and an in‑memory timed queue for event‑driven processing.
Best Practices
Table Row Count via HMS
Instead of Spark COUNT(*), the platform now reads row counts directly from Hive Metastore partitions, reducing resource consumption and achieving sub‑second latency for ~90% of row‑count checks.
Offline Monitoring Optimizations
Trim unnecessary anomaly collection and join operations in Griffin Measure.
Adjust Spark parameters (shuffle, vcore, memory) based on data size and monitoring type.
Introducing OLAP Engine
Presto is used for low‑latency data exploration; heavy queries fallback to Spark. Median exploration time dropped from 7 minutes to under 40 seconds.
Streaming Monitoring Sampling & Multi‑Rule Optimization
Kafka connector supports sampling via offset manipulation (e.g., 1% sampling).
Multiple rules share a single Flink slot, reducing resource usage compared to one‑task‑per‑rule.
Future Evolution
Unified Engine : Consolidate Spark, Flink, and Presto into a single runtime for both batch and streaming.
Intelligent : Apply machine‑learning algorithms for adaptive thresholds and smart alerts.
Convenient : Extend OLAP usage beyond exploration to quality detection and data comparison.
Optimization : Combine rule generation with data exploration for a “what‑you‑see‑is‑what‑you‑get” experience.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.