Big Data 17 min read

How ByteDance’s DataLeap Solves Complex Data Quality Challenges at Scale

This article explains how ByteDance’s DataLeap platform tackles diverse data quality challenges across batch and streaming pipelines by defining quality dimensions, outlining a modular architecture, and sharing best‑practice optimizations for Spark, Flink and Presto‑based monitoring.

ByteDance Data Platform

Dec 29, 2021

How ByteDance’s DataLeap Solves Complex Data Quality Challenges at Scale

What Is Data Quality Management?

Data quality is the degree to which data meets a set of inherent characteristic requirements (quality dimensions). The industry commonly defines six dimensions:

Integrity : Whether records and fields are complete without missing values.

Accuracy : Whether the recorded information is correct and free of anomalies.

Consistency : Whether the same metric yields identical results across different locations.

Timeliness : Whether data is produced quickly enough to retain its value.

Conformity : Whether data follows prescribed rules such as email, IP, or phone format validation.

Uniqueness : Whether data contains duplicate values.

Platform Functions

The DataLeap platform provides three core capabilities:

Data Exploration – view details and distribution across dimensions.

Data Comparison – detect inconsistencies between online and test tables.

Task Monitoring – monitor online data with alert and circuit‑breaker features.

System Architecture

The architecture consists of five modules:

Scheduler : external scheduler that triggers offline monitoring via API or timed jobs.

Backend : stateless service handling API requests, task submission, result aggregation and alert dispatch.

Executor : Spark‑based engine (adapted from Apache Griffin Measure) that adapts data sources, translates rules to SQL, and computes results.

Monitor : stateful service that receives backend events, provides retry and duplicate‑alert mechanisms, and uses an in‑memory queue for event timing.

Alert Center : external alerting service that receives and forwards alarm events.

Offline Data Detection Process

Monitoring trigger: scheduler calls Backend API.

Job submission: Backend submits Spark job to YARN in cluster mode.

Result callback: driver syncs success/failure results back to Backend.

Message trigger: Backend initiates actions such as alerts.

Executor Implementation

The Executor is a customized Spark application based on Apache Griffin Measure. It adapts various data sources, converts rules to SQL, and supports extensions like regular‑expression validation. For streaming scenarios, the engine was switched from Spark to Flink to improve latency and resource usage.

Monitor Implementation

Monitor handles failure retries and duplicate alerts. Initially it polled MySQL for already‑alerted instances, but scaling issues led to a redesign with a stateful service, HA master‑slave nodes, and an in‑memory timed queue for event‑driven processing.

Best Practices

Table Row Count via HMS

Instead of Spark COUNT(*), the platform now reads row counts directly from Hive Metastore partitions, reducing resource consumption and achieving sub‑second latency for ~90% of row‑count checks.

Offline Monitoring Optimizations

Trim unnecessary anomaly collection and join operations in Griffin Measure.

Adjust Spark parameters (shuffle, vcore, memory) based on data size and monitoring type.

Introducing OLAP Engine

Presto is used for low‑latency data exploration; heavy queries fallback to Spark. Median exploration time dropped from 7 minutes to under 40 seconds.

Streaming Monitoring Sampling & Multi‑Rule Optimization

Kafka connector supports sampling via offset manipulation (e.g., 1% sampling).

Multiple rules share a single Flink slot, reducing resource usage compared to one‑task‑per‑rule.

Future Evolution

Unified Engine : Consolidate Spark, Flink, and Presto into a single runtime for both batch and streaming.

Intelligent : Apply machine‑learning algorithms for adaptive thresholds and smart alerts.

Convenient : Extend OLAP usage beyond exploration to quality detection and data comparison.

Optimization : Combine rule generation with data exploration for a “what‑you‑see‑is‑what‑you‑get” experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Quality

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.