Big Data 16 min read

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

Event Tracking Data Flow

Event tracking, also known as Event Tracking , is the bridge between data and business, forming the foundation for analytics, recommendation, and operations.

User actions from apps, mini‑programs, and web applications are collected via event tracking and can be categorized as client‑side, web‑side, or server‑side tracking.

Processing Pipeline

Collected events are sent to a message queue (MQ), then processed by a series of Flink real‑time ETL jobs that perform data standardization, cleaning, field enrichment, and real‑time anti‑fraud checks before being distributed to downstream systems such as recommendation, advertising, A/B testing, behavior analysis, real‑time and offline data warehouses. Stability is the top priority for this upstream data flow.

Scale at ByteDance

Multiple business lines (Douyin, Toutiao, Xigua Video, etc.) feed into the pipeline.

Peak traffic exceeds 100 million events per second , processing over a trillion events daily with petabyte‑scale storage increments.

More than 1,000 Flink jobs and 1,000 MQ topics are deployed across data centers, consuming over 500,000 CPU cores, with single tasks using up to 120,000 cores and a single MQ topic having up to 10,000 partitions.

Key Business Scenarios

UserAction ETL

In recommendation, only a subset of events is needed, requiring timely processing and dynamic ETL rule updates. The pipeline filters, maps, and standardizes events, labeling them with action types to produce UserAction data used for training recommendation models.

Data Sharding

To avoid redundant consumption of the full event stream by each downstream service, a sharding service consumes the upstream topic, applies routing rules, and forwards relevant events to smaller downstream topics, reducing resource usage and MQ bandwidth.

Disaster Recovery and Degradation

Multi‑datacenter deployment ensures continuity when a single data center fails, with rapid traffic switching. Degradation mechanisms handle traffic spikes (e.g., Spring Festival gala, e‑commerce promotions) by throttling or pausing event reporting to maintain stability.

Challenges

High traffic and diverse downstream demands create stability, cost, and performance challenges, requiring solutions that balance reliability with efficiency.

Practical Practices

ETL Pipeline Evolution

Before 2018: Python‑based PyJStorm with a flexible rule engine; stability and performance issues emerged as event volume grew.

2018‑2020: Migration to PyFlink, then Java Flink with a Groovy rule engine and Protobuf, achieving several‑fold performance gains.

2021‑present: Optimized engine performance using Janino (10× faster), established comprehensive event governance, and introduced tiered pipelines with different SLAs.

Rule Engine Development

The rule engine evolved from Python scripts (easy dynamic loading but slow) to Groovy (better performance) and finally to Janino, which compiles Java code on the fly, delivering a ten‑fold speed increase and supporting dynamic schema and routing updates without restarting Flink jobs.

Flink Task Splitting

Large Flink jobs are split into multiple sub‑tasks, each consuming a proportion of upstream partitions and writing to downstream topics, enabling gray‑scale releases, reducing restart impact, and allowing flexible resource allocation.

Disaster Recovery & Degradation

Multi‑datacenter deployment with synchronized MQ and Flink jobs enables minute‑level failover. Degradation strategies include server‑side rate limiting, client‑side back‑off, and selective reporting based on event priority, ensuring critical events remain unaffected during traffic surges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datadata pipelineFlinkScalabilityevent trackingreal-time ETL
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.