How Bilibili Tackles Massive Big‑Data Task Failures with AI Assistants
This article explains Bilibili's large‑scale big‑data platform architecture, the huge volume of offline and real‑time tasks it handles, common failure and slowdown causes, and why the company is exploring AI‑driven assistants to help engineers troubleshoot these issues efficiently.
Background Introduction
Bilibili is a video‑sharing platform with massive data; its big‑data platform underpins many business lines, including AI and commerce.
1. Overall Architecture and Scale
The platform follows a "five‑layer integrated" plus "storage‑compute separation" design. The bottom layer is a distributed file system; the middle layer provides intelligent scheduling; compute engines such as Spark and Flink run alongside clients, real‑time streams (Kafka), OLAP engines (ClickHouse), and various custom tools and CI/CD pipelines.
Daily workload is huge: about 270,000 offline tasks, roughly 20,000 ad‑hoc queries, and around 7,000 critical real‑time jobs. The support team receives thousands of inquiries each week, with each sub‑team handling about three person‑days of tickets, requiring dedicated staff to answer questions about task failures or slowdowns.
2. User Issues
For offline computation, users mainly ask two questions: why a task failed and why it slowed down.
Why tasks fail
Kernel defects – upgrades without sufficient testing can cause large‑scale failures.
Dependency component problems – bugs or upgrades in heavily used components may break dependent tasks.
Data quality issues – corrupted or malformed data can trigger failures.
Other factors such as memory constraints.
Why tasks slow down
Hardware aging – massive storage fleets experience wear, leading to slower I/O.
Resource scheduling pressure – high user volume stresses the scheduler, and mixed‑deployment policies cause resource contention.
Data distribution problems – data skew or inherent data issues degrade performance.
Because the causes are numerous and complex, manual diagnosis is time‑consuming, prompting the exploration of intelligent assistance.
3. Need for AI‑Driven Help
User queries are often terse, containing just a problem description and a link or screenshot. Automating the analysis of such queries with large‑language‑model assistants can accelerate troubleshooting and reduce the operational burden.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
