Big Data 5 min read

ByteDance's Practices for Tracking Data Governance and Pipeline Management

This article explains ByteDance's end‑to‑end tracking data lifecycle management, including pre‑report validation, the rationale for using BMQ over Kafka, quality governance examples, and how Flink‑based pipelines ensure data accuracy through SLA monitoring and checkpoint strategies.

DataFunTalk
DataFunTalk
DataFunTalk
ByteDance's Practices for Tracking Data Governance and Pipeline Management

Q1: Business teams cannot find their tracking data or see mismatched UV compared to business systems; how does ByteDance handle such issues and what is the troubleshooting approach?

A1: ByteDance manages the entire lifecycle of tracking points, from design to a registration‑then‑report control process. Before reporting, the platform's validation tool checks successful reporting and matches fields with registered metadata, greatly reducing missing reports and quality problems.

Q2: Why use an MQ instead of Kafka for the intermediate queue?

A2: The shift from Kafka to BMQ reflects a preference for storage‑compute‑separated MQs like Apache Pulsar, which offer better scalability, lower cost, and avoid performance degradation from data replication, aligning more closely with cloud‑native architectures.

Q3: Any good examples of tracking data quality governance?

A3: Most efforts focus on pre‑report quality control: after registering a tracking point, developers use validation tools for automated testing to ensure accuracy. Post‑report, offline tools monitor quality, but pre‑report checks resolve the majority of issues.

Q4: How is tracking data accuracy ensured, and what level of guarantee is provided during processing?

A4: The processing chain is built on Flink, typically without checkpoints due to downstream requirements that cannot tolerate duplicate or delayed data. SLA metrics monitor loss and duplication rates across the pipeline, reflecting and guaranteeing data accuracy.

刘石伟 ByteDance Tracking Data Flow Technical Lead | Guest Speaker

"ByteDance Tracking Data Flow Construction and Governance Practice" | Source

#Live Recommendation#

DataFun 5th Anniversary – To celebrate the five‑year anniversary, a series of technical articles focusing on hot topics in big data and artificial intelligence will be published from December to January, featuring senior experts summarizing past technological evolution and future trends.

On January 7, 2023, DataFunTalk will release the industry's first data‑intelligence knowledge map; interested participants are invited to schedule the live broadcast.

Big DataFlinkMQdata trackingdata governance
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.