Bilibili’s AI Assistant: Using Large Language Models to Tackle Big Data Ops
This article explains how Bilibili’s massive video platform built a five‑layer, storage‑compute separated big‑data infrastructure and employed a large language model‑driven intelligent assistant to automatically diagnose and resolve frequent offline task failures and slowdowns, addressing common user queries about task reliability and performance.
Background Introduction
Bilibili is a video‑sharing platform with massive data volumes. Its big‑data platform must support numerous business lines, including AI and commercial applications.
The platform follows a “five‑layer integrated” architecture with storage‑compute separation: a distributed file system at the bottom, an intelligent scheduling layer in the middle, various compute engines such as Spark and Flink, client tools, real‑time data streams via Kafka, OLAP engines like ClickHouse, and several custom tools and CI/CD pipelines.
Daily workload is huge: about 270,000 offline tasks, roughly 20,000 ad‑hoc queries, and around 7,000 critical real‑time jobs. The support team receives thousands of inquiries each week, with each sub‑team handling about three person‑days of tickets, requiring dedicated staff to answer questions about task failures or slowdowns.
1. Overall Architecture and Scale
The platform processes an enormous amount of data, and its components are tightly coupled, making troubleshooting complex.
2. User Problems
Users mainly encounter two issues with offline computation: task failures and task slowdowns.
(1) Why do tasks fail?
Kernel defects: upgrades without sufficient testing can cause large‑scale failures.
Dependency component issues: bugs or upgrades in dependent services can break tasks.
Data quality problems: corrupt or malformed data leads to failures.
Other factors such as memory errors may also contribute.
(2) Why do tasks become slow?
Hardware aging: large storage volumes experience degraded read/write speeds over time.
Resource scheduling pressure: high user volume stresses the scheduler, and mixed‑deployment mechanisms cause resource contention across departments.
Data skew or inherent data problems cause performance degradation.
Because the causes of failures and slowdowns are numerous and complex, manual investigation is time‑consuming. Therefore, an intelligent, LLM‑powered assistant is explored to help diagnose and resolve these issues automatically.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
