Bilibili AI Collaboration Platform Based on AIFlow: Architecture, Evolution, and Stream‑Batch Fusion
Bilibili built an AI collaboration platform based on AIFlow to simplify real-time machine-learning workflows, evolving through three stages that added event-driven scheduling, UI-driven parameter management, version snapshots, and a stateless client-server design, while enabling stream-batch fusion for feature back-filling; future work targets high availability, Airflow 2.0 compatibility, and richer streaming ML operators.
1. Background
Bilibili has a mature AI training and experimentation platform supporting recommendation, advertising, and search. As the business evolves, the AI team is shifting from offline to real‑time machine‑learning workflows. The existing real‑time platform built on Flink is widely used, but integrating resources such as Kafka, Flink, and KV stores makes experiment creation complex, time‑consuming, and hard to version‑control.
To simplify the end‑to‑end AI experiment flow, Bilibili built an AI collaboration platform on top of the open‑source AIFlow project from Alibaba.
2. Basic Architecture
AIFlow is an open‑source machine‑learning workflow framework that abstracts workflows as AIGraphs, supports event‑driven stream‑batch scheduling, and allows workflow definition via Python scripts, similar to Apache Airflow.
Key features include extensible plugins (OSS, S3, HDFS, Python, Bash, Flink, etc.) and the ability to run a Server and multiple Clients locally or remotely.
Bilibili’s platform evolved through three stages:
2.1 First Stage
Real‑time transformation of offline ML pipelines (Spark, Hive) to Flink jobs, handling mixed stream‑batch dependencies, and building a RemoteOperator that submits tasks to Kubernetes‑deployed containers with recovery mechanisms. Added support for feature‑crossing issues by inserting custom AIFlow operators and using Watermark‑based notifications.
2.2 Second Stage
After initial deployment, performance and stability were improved by adopting AIFlow’s EventBasedScheduler (based on Airflow 2.0). The internal data product “Ultron” was refined: workflow definitions were simplified, parameters moved to a front‑end UI, and version snapshots were introduced for quick rollback. Additional features such as metadata management, monitoring, and permission control were added.
2.3 Third Stage
The focus shifted to reducing custom modifications of AIFlow source code. The workflow engine was split into a stateless Client (handling RPC with a WebServer and translating business logic into AIFlow Project structures) and a Server that remains largely unchanged from AIFlow, with minimal container‑level adaptations. This design improves horizontal scalability and eases future upgrades.
3. Stream‑Batch Fusion
To support feature back‑filling for long‑window features, Bilibili chose to ingest historical HDFS data into the real‑time pipeline while preserving order. Challenges include ensuring data ordering, version consistency between features and samples, and performance for massive historical data.
Solutions:
Simulate Kafka‑like partitioned, ordered streams for HDFS data, using a custom HDFSSource (initially built with Spark, later to be migrated to Flink).
Store real‑time feature versions in HBase (supports versioned queries) and perform asynchronous pre‑loading for offline features.
4. Outlook
Future work includes:
Improving high‑availability of the EventSchedulerJob (currently limited by database locks).
Seamless compatibility with Airflow 2.0 tasks, enabling migration of existing Airflow clusters.
Extending AIFlow to pyFlink and Alink, adding richer streaming ML operators and enhancing Ultron’s feature‑management capabilities.
Continued exploration of Flink batch capabilities for efficient feature back‑filling aims to achieve true stream‑batch integration across recommendation and advertising workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
