How Bilibili Uses LLMs to Tame Massive Data Platform Failures
Exploring Bilibili’s large‑scale data platform, this article details its five‑layer, storage‑compute separated architecture, the massive daily workload of offline and real‑time tasks, common failure and slowdown causes, and how an LLM‑powered intelligent assistant is being developed to help engineers troubleshoot efficiently.
This article shares Bilibili's practice of building an LLM‑based intelligent assistant for its massive data platform.
Background Introduction
1. Overall Architecture and Scale
Bilibili is a video sharing platform with massive data. Its big‑data platform supports many business lines such as AI and commerce.
The platform follows a “five‑layer integrated” plus “storage‑compute separation” architecture: the bottom layer is a distributed file system; the middle layer provides intelligent scheduling; compute engines include Spark and Flink; clients interact via real‑time streams (Kafka) and OLAP engines (ClickHouse); custom tools and CI/CD platforms are also incorporated.
Task volume is huge: about 270,000 offline tasks, roughly 20,000 ad‑hoc queries, and around 7,000 critical real‑time jobs run each day. The support team receives thousands of inquiries weekly, with each small team handling about three person‑days of queries, requiring dedicated staff to address task failures and slowdowns.
2. User Problems
Users mainly ask two questions about offline computation: why a task fails and why it becomes slow.
(1) Why tasks fail
Kernel defects, especially after kernel upgrades without sufficient testing, can cause large‑scale failures.
Issues in dependent components; many tasks have inter‑dependencies, and bugs or upgrades in a component can trigger failures.
Data quality problems; corrupted or invalid input data may lead to failures.
Other reasons such as memory‑related issues.
(2) Why tasks become slow
Hardware aging; as storage scales, disk wear can reduce read/write speeds.
Resource scheduling pressure; massive user load and mixed deployment cause contention and task delays.
Data distribution issues; data skew or problematic datasets can degrade performance.
Because diagnosing these failures and slowdowns is complex and time‑consuming, Bilibili is exploring intelligent, LLM‑driven methods to assist engineers.
Typical user queries are concise, often just a problem description with a link or a screenshot.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
