Big Data 19 min read

Bilibili's Saber Real-Time Computing Platform: Architecture, Challenges, and AI Integration

Zheng Zhisheng from Bilibili presents the Saber real-time computing platform, detailing its pain points, evolution, Apache Flink‑based architecture, SQL‑centric BSQL programming, DAG drag‑and‑drop design, AI use cases, and future development plans to improve scalability, operability, and AI integration.

DataFunTalk
DataFunTalk
DataFunTalk
Bilibili's Saber Real-Time Computing Platform: Architecture, Challenges, and AI Integration

Introduction

This article, authored by Zheng Zhisheng of Bilibili, introduces the real‑time computing platform Saber, built to address the diverse real‑time processing needs across Bilibili's business units, including AI recommendation, user growth analysis, and BI tasks.

1. Pain Points of Real‑Time Computing

Business teams face high development barriers due to varied languages and frameworks, leading to difficult management and maintenance. High operational costs arise from unstable Spark/YARN clusters and lack of unified monitoring, while AI‑driven real‑time scenarios demand low‑latency, reliable pipelines.

2. Requirements for a New Platform

Provide SQL‑based programming (BSQL) extending Flink SQL.

Support DAG drag‑and‑drop as well as native Jar development.

Offer integrated job management and operations.

3. Apache Flink‑Based Streaming Platform

The platform consists of real‑time transmission and computation layers, unified metadata, lineage, permission, and job‑operation management. Transmission ingests data from APP logs, DB binlogs, and system logs into Kafka or HDFS. Computation uses BSQL on top of Flink, scheduled by YARN.

4. Platform Architecture Evolution

4.1 Platform Architecture The platform integrates transmission, computation, metadata, and operations. Flink runs in a pool, supporting various dimension tables (MySQL, Redis, HBase) and state stores (RocksDB, MapDB, Redis). Data flows from BSQL to sinks such as Kafka, HBase, ES, MySQL, TiDB.

4.2 Development Architecture Design

Top layer: Saber‑Streamer for job submission and API management.

BSQL layer: SQL extension, custom operators.

Runtime layer: manages engine jobs (Spark Streaming, later Flink).

State storage layer: metrics and monitoring.

4.3 Design Principles

Abstract streaming workflows.

Enforce schema completeness.

Provide a generic BSQL parsing layer.

Improve engineering efficiency.

5. Streaming Workflows and BSQL

Streaming workflows consist of Source → Transform (DAG) → Sink. BSQL builds on this, handling DDL extensions, data skew mitigation (bucket + hash), and approximate distinct counts using Redis HyperLogLog.

6. AI Case Study

AI pipelines require both offline and online training. Real‑time feature generation and label joins are performed via Saber‑BSQL, addressing data timeliness, engineering quality, and efficiency challenges.

7. SJoin (Streaming Join) Engineering

High‑throughput joins (e.g., feed vs. click streams) generate massive state and timer pressure. Optimizations include a custom PersistentTimerManager (disk‑spilled timers), RocksDB‑based timers, and Redis for ValueState. SQL syntax was extended to express delayed window joins.

8. DJoin (Dimension Table Join) Engineering

Dimension tables vary in size and update frequency. Small tables use Redis; large tables use HBase with dual‑cluster high‑availability, bulk‑load, and Hystrix for circuit‑breaking. SQL extensions enable batch key extraction and async I/O for efficient joins.

9. Future Directions

Enhance Saber IDE with version control, task debugging, resource management, and SLA‑based operations.

Advance AI capabilities: experiment‑driven SQL pipelines, unified batch/stream SQL, model evaluation, and alerting.

Continue to improve real‑time feature engineering and multi‑feature composition.

In summary, the Saber platform unifies real‑time data ingestion, processing, and AI model training on top of Apache Flink, offering SQL‑centric development, DAG visual programming, and robust operational tooling to meet Bilibili's large‑scale streaming requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataReal-time StreamingApache FlinkAI integrationBSQL
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.