Big Data 5 min read

How Bilibili Tackles Massive Big‑Data Task Failures with AI Assistants

This article explains Bilibili's large‑scale big‑data platform architecture, the huge volume of offline and real‑time tasks it handles, common failure and slowdown causes, and why the company is exploring AI‑driven assistants to help engineers troubleshoot these issues efficiently.

DataFunTalk
DataFunTalk
DataFunTalk
How Bilibili Tackles Massive Big‑Data Task Failures with AI Assistants

Background Introduction

Bilibili is a video‑sharing platform with massive data; its big‑data platform underpins many business lines, including AI and commerce.

1. Overall Architecture and Scale

The platform follows a "five‑layer integrated" plus "storage‑compute separation" design. The bottom layer is a distributed file system; the middle layer provides intelligent scheduling; compute engines such as Spark and Flink run alongside clients, real‑time streams (Kafka), OLAP engines (ClickHouse), and various custom tools and CI/CD pipelines.

Daily workload is huge: about 270,000 offline tasks, roughly 20,000 ad‑hoc queries, and around 7,000 critical real‑time jobs. The support team receives thousands of inquiries each week, with each sub‑team handling about three person‑days of tickets, requiring dedicated staff to answer questions about task failures or slowdowns.

Bilibili big‑data platform overview
Bilibili big‑data platform overview

2. User Issues

For offline computation, users mainly ask two questions: why a task failed and why it slowed down.

Why tasks fail

Kernel defects – upgrades without sufficient testing can cause large‑scale failures.

Dependency component problems – bugs or upgrades in heavily used components may break dependent tasks.

Data quality issues – corrupted or malformed data can trigger failures.

Other factors such as memory constraints.

Why tasks slow down

Hardware aging – massive storage fleets experience wear, leading to slower I/O.

Resource scheduling pressure – high user volume stresses the scheduler, and mixed‑deployment policies cause resource contention.

Data distribution problems – data skew or inherent data issues degrade performance.

Because the causes are numerous and complex, manual diagnosis is time‑consuming, prompting the exploration of intelligent assistance.

Task failure and slowdown factors
Task failure and slowdown factors

3. Need for AI‑Driven Help

User queries are often terse, containing just a problem description and a link or screenshot. Automating the analysis of such queries with large‑language‑model assistants can accelerate troubleshooting and reduce the operational burden.

AI assistant concept
AI assistant concept
Intelligent assistance workflow
Intelligent assistance workflow
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

task schedulingAI AssistantBilibiliperformance troubleshooting
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.