How Bilibili Uses LLMs to Diagnose Big Data Platform Issues
This article explains how Bilibili leverages a large‑language‑model‑driven assistant to diagnose and resolve failures and slowdowns in its massive big‑data platform, detailing the platform’s five‑layer architecture, common task issues, and the need for intelligent troubleshooting tools.
Background
This article shares Bilibili’s practice of building an intelligent assistant powered by large language models (LLMs) to help troubleshoot its massive big‑data platform.
1. Overall Architecture and Scale
Bilibili is a video‑sharing platform with huge data volumes. Its big‑data platform supports many business lines such as AI and commerce. The platform follows a “five‑layer integrated + storage‑compute separation” architecture: a distributed file system at the bottom, an intelligent scheduling layer in the middle, various compute engines (Spark, Flink), real‑time streams (Kafka), an OLAP engine (ClickHouse), client tools, and custom CI/CD pipelines.
The platform runs about 270 000 offline tasks daily, roughly 20 000 ad‑hoc queries, and 7 000 critical real‑time jobs. Support teams receive thousands of weekly inquiries, each small team handling about three person‑days of tickets, requiring dedicated staff to answer questions about task failures or slowdowns.
2. Users’ Problems
For offline computation, users mainly ask two questions: why a task fails and why it becomes slow.
Why tasks fail
Kernel defects – upgrades without sufficient testing can cause large‑scale failures.
Dependency component issues – bugs or upgrades in dependent services propagate failures.
Data quality problems – corrupted or invalid input data leads to failures.
Other reasons such as memory‑related errors.
Why tasks become slow
Hardware aging – disk wear reduces read/write speed over time.
Resource scheduling pressure – massive user load and mixed‑deployment policies cause contention.
Data skew – uneven data distribution or problematic data slows processing.
Because diagnosing these causes is time‑consuming, Bilibili explores using intelligent methods (LLMs) to assist engineers.
Typical User Queries
Queries are usually terse, often just a problem description with a link or a screenshot, lacking detailed context.
The excerpt is taken from the e‑book “A Plain‑spoken Large‑Model Handbook”. Scan the QR code below to join the community and receive the e‑book.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
