How Bilibili Leverages Large Language Models to Automate Big Data Task Troubleshooting
This article explains Bilibili's large‑scale data platform architecture, the common offline‑task failures and slowdowns users encounter, and how a large language model‑driven intelligent assistant is being built to automatically diagnose and resolve these engineering problems.
Background Introduction
This article shares Bilibili's practice of an intelligent agent assistant based on large language models.
1. Overall Architecture and Scale
Bilibili is a video‑sharing platform with massive data. Its big‑data platform supports many business lines, including AI and commerce.
The platform follows a “five‑layer integrated + storage‑compute separation” architecture: a distributed file system at the bottom, an intelligent scheduling layer in the middle, various compute engines such as Spark and Flink, client tools, real‑time data streams (Kafka), OLAP engine (ClickHouse), plus custom tools and a CI/CD platform.
2. User Problems
Users of the offline compute system mainly face two issues: why tasks fail and why tasks become slower.
Why tasks fail
Kernel defects, especially after kernel upgrades without sufficient testing.
Dependency component problems; many tasks depend on shared resources that may have bugs or upgrades.
Data quality issues that cause failures.
Other reasons such as memory problems.
Why tasks become slower
Hardware aging; large‑scale storage disks degrade over time, reducing read/write speed.
Resource scheduling pressure; mixed deployment across departments can cause contention during peak periods.
Data distribution problems, including data skew.
Diagnosing these failures or slowdowns is complex and time‑consuming, prompting the exploration of intelligent methods to assist.
User Query Characteristics
Typical user queries are highly engineering‑focused, often consisting of a brief problem description plus a link or screenshot, with little additional context.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
