How Bilibili Uses Large Language Models to Tackle Big Data Platform Issues
This article explains how Bilibili's massive video platform leverages a large‑language‑model‑driven assistant to diagnose and resolve offline task failures and slowdowns within its five‑layer, storage‑compute separated big data architecture, improving operational efficiency for thousands of daily queries.
Background Introduction
Bilibili is a video‑sharing platform with massive data volumes, and its big data platform supports many critical business lines such as AI and commerce.
The platform follows a “five‑layer integrated” plus “storage‑compute separation” architecture: a distributed file system at the bottom, an intelligent scheduling layer, various compute engines (Spark, Flink), client tools, real‑time streams (Kafka), an OLAP engine (ClickHouse), and custom CI/CD tools.
Task volume is huge: about 270,000 offline tasks, roughly 20,000 ad‑hoc queries, and 7,000 important real‑time jobs run each day. The support team receives thousands of queries weekly, with each sub‑team handling about three person‑days of troubleshooting, requiring dedicated staff to address task failures and slowdowns.
1. Overall Architecture and Scale
The platform’s architecture is described above, emphasizing the layered design and the massive scale of daily computations.
2. User Problems
Users mainly encounter two issues with offline computation: task failures and task slowdowns.
Why tasks fail
Kernel defects – upgrades without sufficient testing can cause large‑scale failures.
Dependency component issues – bugs or upgrades in dependent services propagate failures.
Data quality problems – corrupted or invalid input data leads to failures.
Other factors such as memory constraints.
Why tasks become slow
Hardware aging – large storage volumes cause disk read/write speed degradation over time.
Resource scheduling pressure – high user load and mixed‑deployment mechanisms cause contention.
Data distribution issues – data skew or problematic data sets slow processing.
Because diagnosing these causes is time‑consuming, Bilibili explores intelligent, LLM‑based assistance to help users quickly identify and resolve failures or slowdowns.
Practical Insight
User queries are typically concise, often just a problem description with a link or screenshot, making automated assistance especially valuable.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
