Bilibili’s AI‑Powered Assistant: Solving Big Data Task Failures with LLMs
This article details Bilibili's implementation of a large‑language‑model‑driven intelligent assistant that helps engineers diagnose and resolve massive offline and real‑time data‑processing failures, describing the platform’s five‑layer architecture, common failure and slowdown causes, and the need for AI‑powered troubleshooting support.
This article shares Bilibili's practice of building an intelligent assistant powered by large language models (LLMs) to help engineers troubleshoot the company's massive big‑data platform.
The platform follows a "five‑layer integrated + storage‑compute separation" architecture. The bottom layer is a distributed file system; the middle layer provides intelligent scheduling and hosts various compute engines such as Spark and Flink, as well as clients, real‑time streams (Kafka), OLAP engines (ClickHouse), custom tools, and a CI/CD platform.
Every day the platform runs about 270,000 offline tasks, around 20,000 ad‑hoc queries, and roughly 7,000 critical real‑time jobs. The support team receives thousands of inquiries weekly, with each small team handling about three person‑days of tickets, requiring dedicated staff to answer questions about task failures or slowdowns.
Users mainly ask two questions about offline computation: why a task fails and why it becomes slow.
Task failures are often caused by kernel defects (e.g., kernel upgrades without sufficient testing), issues in dependent components (bugs or upgrades in shared resources), data quality problems, or other factors such as memory errors.
Task slowdowns stem from hardware aging (disk wear affecting read/write speed), resource scheduling pressure under high load and mixed deployment across departments, and data distribution problems like data skew.
Because diagnosing these issues is time‑consuming and complex, Bilibili explores using AI‑driven assistance to automatically analyze failure symptoms and suggest solutions.
Typical user queries are concise, often just a problem description with a link or a screenshot, highlighting the need for an engineering‑focused, intelligent troubleshooting assistant.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
