Bilibili’s AI Assistant: Using Large Language Models to Tackle Massive Data Tasks
This article explains how Bilibili leverages a large‑language‑model‑based intelligent agent to diagnose and resolve failures and slowdowns in its massive big‑data platform, detailing the platform architecture, workload scale, common user issues, and the need for automated assistance.
Introduction
The article shares Bilibili’s practice of building an intelligent agent assistant powered by large language models to help users troubleshoot problems in its massive data platform.
Background
Bilibili is a video‑sharing platform with massive data volumes. Its big‑data platform supports many services such as AI and commerce. The platform follows a “five‑layer integrated + storage‑compute separation” architecture: a distributed file system at the bottom, an intelligent scheduling layer in the middle, various compute engines (Spark, Flink), real‑time streams (Kafka), OLAP engines (ClickHouse), client tools, and custom CI/CD pipelines.
Scale and Challenges
The platform processes about 270,000 offline tasks, 20,000 ad‑hoc queries, and 7,000 real‑time jobs daily. User support volume is also high, with thousands of weekly inquiries, each team handling roughly three person‑days of troubleshooting per week.
User Issues
Users mainly encounter two problems with offline jobs: failures and performance degradation.
Why tasks fail
Kernel defects – upgrades without sufficient testing can cause large‑scale failures.
Dependency component issues – bugs or upgrades in shared components propagate failures.
Data quality problems – corrupted or malformed data leads to failures.
Other factors such as memory errors.
Why tasks become slow
Hardware aging – large storage arrays degrade over time, reducing I/O speed.
Resource scheduling pressure – massive user load and mixed‑deployment policies cause contention.
Data skew or distribution problems – uneven data partitions slow processing.
Need for Intelligent Assistance
Diagnosing these failures and slowdowns is labor‑intensive and time‑consuming, prompting the exploration of AI‑driven tools to automate root‑cause analysis and provide actionable guidance.
User Query Characteristics
Typical queries are concise and engineering‑focused, often consisting of a brief problem description plus a link or screenshot.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
