Big Data 5 min read

How Bilibili Tackles Massive Big‑Data Task Failures with AI Assistants

This article explains Bilibili's large‑scale big‑data platform architecture, the huge volume of offline and real‑time tasks it handles, common failure and slowdown causes, and why the company is exploring AI‑driven assistants to help engineers troubleshoot these issues efficiently.

DataFunTalk

Oct 6, 2025

How Bilibili Tackles Massive Big‑Data Task Failures with AI Assistants

Background Introduction

Bilibili is a video‑sharing platform with massive data; its big‑data platform underpins many business lines, including AI and commerce.

1. Overall Architecture and Scale

The platform follows a "five‑layer integrated" plus "storage‑compute separation" design. The bottom layer is a distributed file system; the middle layer provides intelligent scheduling; compute engines such as Spark and Flink run alongside clients, real‑time streams (Kafka), OLAP engines (ClickHouse), and various custom tools and CI/CD pipelines.

Daily workload is huge: about 270,000 offline tasks, roughly 20,000 ad‑hoc queries, and around 7,000 critical real‑time jobs. The support team receives thousands of inquiries each week, with each sub‑team handling about three person‑days of tickets, requiring dedicated staff to answer questions about task failures or slowdowns.

2. User Issues

For offline computation, users mainly ask two questions: why a task failed and why it slowed down.

Why tasks fail

Kernel defects – upgrades without sufficient testing can cause large‑scale failures.

Dependency component problems – bugs or upgrades in heavily used components may break dependent tasks.

Data quality issues – corrupted or malformed data can trigger failures.

Other factors such as memory constraints.

Why tasks slow down

Hardware aging – massive storage fleets experience wear, leading to slower I/O.

Resource scheduling pressure – high user volume stresses the scheduler, and mixed‑deployment policies cause resource contention.

Data distribution problems – data skew or inherent data issues degrade performance.

Because the causes are numerous and complex, manual diagnosis is time‑consuming, prompting the exploration of intelligent assistance.

3. Need for AI‑Driven Help

User queries are often terse, containing just a problem description and a link or screenshot. Automating the analysis of such queries with large‑language‑model assistants can accelerate troubleshooting and reduce the operational burden.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Task scheduling AI assistant Bilibili performance troubleshooting

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.