Big Data 27 min read

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

This article presents a technical deep‑dive into Bilibili’s evolution from offline to real‑time data processing, describing the challenges of timeliness, ETL, AI feature engineering, and the design of a Flink‑on‑YARN incremental pipeline that supports trillion‑scale message throughput and AI‑driven real‑time applications.

DataFunTalk
DataFunTalk
DataFunTalk
Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

The presentation, delivered by Zheng Zhi‑sheng, head of Bilibili’s real‑time big‑data platform, outlines the historical context of Bilibili’s data processing, the shift from offline (day‑level) to real‑time (minute‑level) workloads, and the resulting requirements for high data freshness across recommendation, advertising, and monitoring scenarios.

Key pain points identified include:

Insufficient compute capability in the original Flume‑to‑HDFS pipeline, leading to data loss and high duplication rates.

Resource contention during nightly batch windows, causing scheduler overload.

Difficulty bridging the gap between pure real‑time and offline processing, especially for large‑scale streams such as Bilibili’s comment data.

Complexity of AI real‑time feature engineering, including duplicated logic between online and offline, long feature pipelines, and scaling challenges for algorithm teams.

To address these issues, the team built a Flink‑centric ecosystem:

Flink on YARN incremental pipeline : a trillion‑scale DAG with >30,000 compute cores, 1,000+ jobs, and 100+ users, using KafkaSource, parser, custom ETL, and exporter modules.

Data model convergence via Protobuf‑to‑Parquet mapping, enabling strong schema enforcement and efficient storage.

Partition‑ready notification using Watermark propagation to trigger downstream Hive or Kafka actions once a partition is complete.

CDC pipeline integrating MySQL binlog with HUDI for upsert handling.

Stability improvements: asynchronous HDFS close, checkpoint parallelism, small‑file merging, and fault‑tolerant partition recovery.

Feature engineering challenges are tackled by:

Transforming minute‑level sliding windows into Group‑By‑based aggregations with UDFs backed by RocksDB, reducing timer overhead.

Supporting array‑level dimension table lookups via optimized HBase/ClickHouse integration.

Ensuring consistency between real‑time and offline feature calculations through unified SQL semantics.

The end‑to‑end AI workflow is orchestrated with AIFlow, allowing Python‑defined DAGs that combine streaming Flink jobs and batch Spark jobs, driven by event‑based watermark notifications. This enables seamless experiment iteration, model training, and online serving.

Future directions focus on lakehouse integration (Flink + Iceberg/HUDI) for ODS‑to‑DW incremental processing and further simplification of the real‑time AI experimentation platform.

big datadata pipelineFlinkAIReal-time StreamingyarnIncremental Processing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.