How Bilibili Uses Large Language Models to Solve Big Data Platform Issues
This article explains Bilibili's massive data platform architecture, the common offline‑task failures and slowdowns users encounter, and how the company applies a large‑language‑model‑driven intelligent assistant to diagnose and resolve these engineering problems efficiently.
Introduction
This article shares Bilibili's practice of building an intelligent assistant based on large language models.
Background
Bilibili is a video‑sharing platform with massive data volumes. Its big‑data platform supports many business lines, including AI and commerce.
Overall Architecture and Scale
The platform follows a “five‑layer integrated” plus “storage‑compute separation” design. The bottom layer is a distributed file system; the middle layer includes an intelligent scheduler and various compute engines such as Spark and Flink, as well as clients, real‑time streams (Kafka), OLAP engines (ClickHouse), and custom tools and CI/CD pipelines.
Every day the platform processes about 270,000 offline tasks, around 20,000 ad‑hoc queries, and roughly 7,000 critical real‑time jobs. The support team receives thousands of inquiries weekly, with each sub‑team handling about three person‑days of tickets, requiring dedicated staff to address task failures and slowdowns.
User Problems
For offline computation, users mainly ask two questions: why a task failed and why it became slow.
Why Tasks Fail
Kernel defects – upgrades without sufficient testing can cause large‑scale failures.
Dependency issues – component upgrades or bugs in shared resources can break dependent tasks.
Data quality problems – corrupted or malformed data leads to failures.
Other reasons such as memory errors.
Why Tasks Slow Down
Hardware aging – massive storage disks wear out, reducing read/write speed.
Resource scheduling pressure – high user volume and mixed deployment cause contention.
Data skew – uneven data distribution or problematic data slows processing.
Diagnosing these causes is labor‑intensive, prompting the exploration of intelligent methods to assist.
Need for Intelligent Assistance
Users typically submit concise, engineering‑focused questions, often just a problem description with a link or screenshot, lacking detailed context.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
