How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline
This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.
Event Tracking Data Flow
Event tracking, also known as Event Tracking , is the bridge between data and business, forming the foundation for analytics, recommendation, and operations.
User actions from apps, mini‑programs, and web applications are collected via event tracking and can be categorized as client‑side, web‑side, or server‑side tracking.
Processing Pipeline
Collected events are sent to a message queue (MQ), then processed by a series of Flink real‑time ETL jobs that perform data standardization, cleaning, field enrichment, and real‑time anti‑fraud checks before being distributed to downstream systems such as recommendation, advertising, A/B testing, behavior analysis, real‑time and offline data warehouses. Stability is the top priority for this upstream data flow.
Scale at ByteDance
Multiple business lines (Douyin, Toutiao, Xigua Video, etc.) feed into the pipeline.
Peak traffic exceeds 100 million events per second , processing over a trillion events daily with petabyte‑scale storage increments.
More than 1,000 Flink jobs and 1,000 MQ topics are deployed across data centers, consuming over 500,000 CPU cores, with single tasks using up to 120,000 cores and a single MQ topic having up to 10,000 partitions.
Key Business Scenarios
UserAction ETL
In recommendation, only a subset of events is needed, requiring timely processing and dynamic ETL rule updates. The pipeline filters, maps, and standardizes events, labeling them with action types to produce UserAction data used for training recommendation models.
Data Sharding
To avoid redundant consumption of the full event stream by each downstream service, a sharding service consumes the upstream topic, applies routing rules, and forwards relevant events to smaller downstream topics, reducing resource usage and MQ bandwidth.
Disaster Recovery and Degradation
Multi‑datacenter deployment ensures continuity when a single data center fails, with rapid traffic switching. Degradation mechanisms handle traffic spikes (e.g., Spring Festival gala, e‑commerce promotions) by throttling or pausing event reporting to maintain stability.
Challenges
High traffic and diverse downstream demands create stability, cost, and performance challenges, requiring solutions that balance reliability with efficiency.
Practical Practices
ETL Pipeline Evolution
Before 2018: Python‑based PyJStorm with a flexible rule engine; stability and performance issues emerged as event volume grew.
2018‑2020: Migration to PyFlink, then Java Flink with a Groovy rule engine and Protobuf, achieving several‑fold performance gains.
2021‑present: Optimized engine performance using Janino (10× faster), established comprehensive event governance, and introduced tiered pipelines with different SLAs.
Rule Engine Development
The rule engine evolved from Python scripts (easy dynamic loading but slow) to Groovy (better performance) and finally to Janino, which compiles Java code on the fly, delivering a ten‑fold speed increase and supporting dynamic schema and routing updates without restarting Flink jobs.
Flink Task Splitting
Large Flink jobs are split into multiple sub‑tasks, each consuming a proportion of upstream partitions and writing to downstream topics, enabling gray‑scale releases, reducing restart impact, and allowing flexible resource allocation.
Disaster Recovery & Degradation
Multi‑datacenter deployment with synchronized MQ and Flink jobs enables minute‑level failover. Degradation strategies include server‑side rate limiting, client‑side back‑off, and selective reporting based on event priority, ensuring critical events remain unaffected during traffic surges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
